FAQs Preimputation (QC)

Why is Ricopili not populating the name in the [disease].names file?

If the dataset has previously been put through the Ricopili pipeline, Ricopili may recognize the file title format as such (disease_batch_popname_initials-qc.[bed,bim,fam]) and so will not generate a dataset name for the [disease].names file.

Sex checks

This guidance is for circumstances where the analyst has doubts about the accuracy of the phenotype sex variable in the plink binary file. The Preimputation Module automatically does sex checks and exclusions. Problems to be aware of when interpreting the QC sex check output (page 3 of qc/disease_batch_popname_initials-qc.pdf) are noted next.

If the data are not SNP-QCd, this can lead to problems with getting accurate sex check results. Since Ricopili does sample QC before SNP QC, there are instances in lower quality data where the sex check generates misleading sex information. Likewise, running Plink –check-sex on lower quality data may also generate inaccurate sex information. A remedy is to SNP-QC the dataset first, doing this can fix this issue entirely. Manually SNP-QC the dataset in Plink with these commands (plink –bfile infilename –mind 0.01 –geno 0.01 –maf 0.05 –out outfilename), then LD prune the data (see Joni Coleman’s scripts to LD prune, see here for a list of high-LD regions in GRCh37/hg19 assembly), and run Plink –check-sex on this cleaned dataset. This can help to generate accurate sex information. Note, it is preferable to have common variants and to filter on MAF of 0.05 for sex calling - http://pubmedcentralcanada.ca/pmcc/articles/PMC5100670/. Also, Plink check-sex requires fairly accurate minor allele frequencies, when the dataset is small, it is generally necessary to obtain these from another source (eg 1000 Genomes) and use --read-freq.

If there are not enough X variants for sex calling, the sex checks may fail without you knowing. Check to see how many X variants there are in the Plink .bim file, as you want at least 100 X variants be able to call sex with Plink. When the pattern of sex was unstable with sex checks done repeatedly as data were processed through Ricopili, the reason was that there were not enough X variants. In these instances, you may see sex “jumping” around (e.g., many sex fails in one dataset during cleaning, then puzzlingly many again later in a cleaner version of the dataset). If you do not have enough X variants then SNP sex cannot be used as a method of double-checking the accuracy of phenotype sex, or determining the sex composition of the cohort. Beware that in these circumstances you will get ouput on sex fails but it will not be trustworthy.

How/tips to combining case and control datasets from different sources?

Please go through this document which describes this process in detail.

Dataset with Several Ethnicities

If your dataset contains different ethnicities, split the PLINK files into each ancestry grouping and make separate working directories to QC each created file.

The Preimputation Module has stopped, but I don’t see any errors. My log file is telling me it stopped at the “qc” step. What has happened?

Sometimes, Ricopili might get stuck right before the finish (i.e., no longer in job queue, but not finished, and the last step it started according to the log files was the ‘qc’ step). If after trouble-shooting to find error messages, there are no obvious errors, resubmit the preimputation command. Usually, Ricopili completes within a few seconds and a success message appears in the terminal window.