Preimputation (QC)

Overview

It's highly recommend to go through RICOPILI tutorial before using this module. This also makes available simulated data for testing the pipeline.

This module performs SNP and Sample quality control (QC) of multiple datasets in parallel.

Throughout these examples we will be using real datasets. If you are a Psychiatric Genomics Consortium (PGC) analyst in training, you can access these datasets via the Data Access Committee. Contact hunna_watson@med.unc.edu for information on how to apply for these data. Then you can emulate the examples while following along with the tutorials. If you are external to the PGC but are interested in Ricopili, we welcome you to this tutorial. You will have the advantage of looking at analysis results based on real data, since artificially created (simulated) datasets do not emulate well the linkage disequilibrium and other properties of real human data. It is our first priority to present a tutorial based upon real data. We are looking into creating some data files that will be freely downloadable by all in the future, but for now, because of time constraints and consent issues, use of the datasets in this tutorial requires authorization.

Input Requirements

  1. Binary PLINK files (bed/bim/fam), multiple datasets in working directory are allowed

  2. Phenotypes coded as 1 (control) or 2 (case). Missing phenotypes are not allowed right now (can be changed downstream)

  3. Allele names A,C,G,T

  4. Genome build hg16, hg17, hg18 or hg19

Technical Details

The default QC parameters used by Ricopili, are in order below. It is possible to change the QC thresholds using optional commands.

SNP QC: call rate 0.95 (this criterion is useful when merging case and control datasets from different studies)

Sample QC: call rate in cases or controls 0.98

Sample QC: FHET within +/- 0.20 in cases or controls

Sample QC: Sex violations (excluded) - genetic sex does not match pedigree sex

Sample QC: Sex warnings (not excluded) - undefined phenotype / ambiguous genotypes

SNP QC: call rate 0.98

SNP QC: missing difference 0.02

SNP QC: SNPs with no valid association p value are excluded (i.e., invariant SNP)

SNP QC: with MAF 0.01

SNP QC: Hardy-Weinberg equilibrium (HWE) in controls p value 1e-06 (i.e., log 10(p) of -6, which is the units Ricopili expresses HWE in)

SNP QC: Hardy-Weinberg equilibrium (HWE) in cases p value 1e-10 (i.e., log 10(p) of -10, which is the unites Ricopili expresses HWE in)

Usage Instructions

1. Make a brand-new directory that is your working directory for the project

For example, make a directory called RicopiliTraining/ that contains all of the Ricopili output for a schizophrenia project.

mkdir ~/RicopiliTraining/preimputation/

Then make a directory for the first cohort in the project that you are going to QC.

mkdir ~/RicopiliTraining/preimputation/aber

2. Change into the new working directory for the first cohort

cd ~/RicopiliTraining/preimputation/aber

3. Copy (or make symbolic links) to all input PLINK files you want to QC into this directory

cd ~/RicopiliTraining

cp ~/RicopiliTrainingData/scz_aber_eur-qc.fam ~/RicopiliTraining/aber.fam

cp ~/RicopiliTrainingData/scz_aber_eur-qc.bim ~/RicopiliTraining/aber.bim

cp ~/RicopiliTrainingData/scz_aber_eur-qc.bed ~/RicopiliTraining/aber.fam

4. Run checks on the data

Check the coding of variables is at it should be (.fam file, column 5: male=1, female=2; .fam file, column 6: control=1, case=2) and what you expect based on your knowledge of the data. Gather a count of sample and SNP N in the files.

data=aber

head *fam

echo "n total" && wc -l $data.fam

echo "n Cases" && awk '$6==2' $data.fam | wc -l

echo "n Controls" && awk '$6==1' $data.fam | wc -l

echo "n female cases" && awk '$5==2 && $6==2' $data.fam | wc -l

echo "n male cases" && awk '$5==1 && $6==2' $data.fam | wc -l

echo "n female controls" && awk '$5==2 && $6==1' $data.fam | wc -l

echo "n male controls" && awk '$5==1 && $6==1' $data.fam | wc -l

echo "n SNPs" && wc -l $data.bim

This is the output from our example. All looks okay:

5. Run the Ricopili pre-imputation module with the command below. I have added an optional command, an MAF filter of 0.01, to include only common variants (--maf 0.01), but this can be left off.

preimp_dir --dis scz --pop eur --out aber --maf 0.01

--dis is a required option that must be a 3 letter abbreviation for your phenotype. In this example, scz stands for schizophrenia. For example disease abbreviations, see here.

--pop stands for the ancestry of the samples in your dataset. For example, "eur" stands for European. There's also "asn" for Asian, "aam" for African-American, "afr" for African, "his" for Hispanic.

* this is used only for naming conventions and will have no effect on genotypes / SNP selections. you are free to use any name, but it's helpful to restrict on these names here. if unclear use "mix".

--out is a required option that is the output name for the project (only used to identify this pipeline run in the log files and in the queue-jobs).

To find out what the latest version is, go to the source code repository (here), look for the file with the corresponding name, note the version number, and write it into the command line i.e., "preimp_dir_12"

For other command options, see here.

Troubleshooting: If you get an error message suggesting that Ricopili can’t find that module, go to the rp_bin directory where Ricopili is installed, and look up the name of the module. The command to invoke the module should match the module’s name in the rp_bin directory. Examples: “preimp_dir”, “preimp_dir_12”.

The following output will be shown:

6. When prompted, edit the text file ending in *.names using a text editor such as emacs or vim. The number of lines in this file corresponds to the number of datasets in the working directory. Each line consists of four columns: 1) a 5-character identifier for each particular study (recommended identifiers are city or study abbreviation and wave, e.g. lond1, cloz2), 2) the root name of one PLINK fileset: *.fam, *.bim, *.bed (e.g. when entered "mydata" the pipeline will search for "mydata.bed"), 3) Numeric value indicating how many consecutive cycles of quality control (QC) have been run using the pipeline, which will also be appended to the resulting files (e.g. if this is the first QC you are performing, you should simply enter "1"), 4) an option to exclude a dataset present in your working directory from the present QC process; use "0" if you want the dataset included.

For our current example, we open the file in .vim, enter i to insert text, and make sure we have something similar to the following on the first line. Then to save, write, and close the .vim file, we do <escape> : w q <enter>. 'w' is for write and 'q' is for quit.

Notes

  • Remove any rows from this file, where you want to exclude the named dataset from QC

  • Below is an example with multiple datafiles, edited in emacs

Example .names file as generated by Ricopili displayed in emacs.

Edited .names file with 4-letter cohort identifier added to each row.

To save the file, enter control-x control-s. Then enter control-x control-c to exit emacs.

Alternatively, you can use vi or another text editing program.

7. Rerun the command to start QC

preimp_dir --dis scz --pop eur --out aber --maf 0.01

Here is the output generated. You will see the message that the job has been successfully submitted to the queue:

Note, you can add --serial on to the command to see the job running interactively (i.e., preimp_dir --pop eur --dis scz --maf 0.01 --out aber –serial). This is good for trouble-shooting.

8. To check the progress of the job, you can look to see if the jobs are in the queue (still going) or not (finished) (line 1 below), and, second, check the output logs and look for the “finished” notification in the status (third column) of the Ricopili log file for the preimp_dir script (line 2 below). Alternatively, await an email, or check to see if “success_file” has been outputted (line 3 below)

qstat -u <user>

tail ~/ricopililogfiles/reimp_dir_info

ls ~/RicopiliTraining/preimputation/aber/qc/imputation/success_file

Where <user> is your username.

You should see N+1 jobs have been submitted to the queue (N = number of datasets to be QCed).

Example qstat output with one input dataset to be QC'd:

The log file loloc/preimp_dir_info contains a record of all commands submitted, where loloc is the directory specified by the user in the installation step where output log files are stored. The preimp_dir_info log file is formatted as follows:

· The first column is the working directory the command was submitted from.

· The second column is the exact command entered to start pre-imputation/QC.

· The third column is a flag to let you know what step the pipeline is at. "finished" means the pipeline completed successfully

· The last column is the date and time the message was printed.

Here is what it looks like:

You can check the location of the log file by opening the ricopili.conf file and looking up the “loloc” row:

You will also get an email when it is finished, based on the email you listed in the ricopili.conf file, if your cluster allows emails to be sent to this address.

Example email:

9. Look at the output files in the qc/ directory

PLINK files with -qc extension will be in your directory for each file in disease.names. This extension will be according to the QC cycle you selected in your disease.names file (see step 6), e.g. "-qc1", "-qc2" etc.

The naming of the files is disease_batch_popname_initials-qc.[bed,bim,fam], where disease is the 3 letter phenotype abbreviation specified by --dis, batch is the cohort identifier in disease.names, popname is the population name specified by --popname, and initials are the user's 2 letter initials specified in $HOME/ricopili.conf.

The file qc/disease_batch_popname_initials-qc.pdf.gz contains a summary of the qc that occurred for each batch including the parameters used.

For a more detailed description of the output files, see here.

Output

Directory Structure

where <aber> is the name that was given to the cohort and <hw> is replaced with your initials you entered during installation.

The top level directory contains the original data files along with two subdirectories: errandout/ and qc/. The errandout directory contains LISA output for submitting jobs related to updating ID names and guessing which platform the data was generated on. The qc directory contains the final QC'd files as well as symbolic links to the original files. In addition, there are three subdirectories: errandout/, imputation/, and tmp_report_scz_aber_eur_<hw>_0.02_0.02_0.02_0.01/. The errandout/ subdirectory in QC contains LISA log files for all of the QC processing. The imputation/ subdirectory contains symbolic links to the QC'd output and a README stating how to run imputation. The tmp_report_mix_gpc1_eur_<hw>_0.02_0.02_0.02_0.01/ directory contains all of the intermediate PLINK and text files generated during QC.

Output Files

Legend:

disease: the disease abbreviation specified by the --dis flag

batch: the 4 letter batch name abbreviation specified by disease.names

popname: the population type specified by --pop (default = european)

initials: your initials specified in the installation step

Interpretation

disease_batch_popname_initials-qc.pdf Example: scz_aber_eur_hw.pdf

Page 3: The sample and SNP N in the raw files versus the completed QCd file is given, with a table summarizing the reasons for removal. Because we used a file that had already been QCd, only SNPs that met our optional –maf filter criteria were removed and no other SNPs or samples. Make sure that your datafile has sex chromosomes if you want sex violations and warnings, if your datafile does not have sex chromosomes it will output that 0 samples were flagged/removed without alerting you that there are no sex loci.

Page 12: From page 12, lambda (chi square of the median p-value / chi square of expected p-value of 0.5) plotted against several variables is given. The blue data points with SNP names are the top SNPs, and the blue scale on the right of the plot is the -log10 p value for the SNP. The red line is a line plot of lambda, with the lambda scale in red on the right. The grey bars are the frequency of the variable plotted, for instance, F_MISS which is fraction of sample missingness. The ideal plot following QC will have a lambda around ~1 to 1.10. Usually, the PCA module can be relied upon to further reduce lambda if it is deviating above 1. These plots can help pinpoint thresholds to assist with further QC. They can also reveal top SNPs that are potential artifacts of QC problems.

Abbreviations in disease_batch_popname_initials-qc.pdf

CHR=chromosome

F_A=frequency in cases

F_U=frequency in controls

F_MISS=fraction of sample missingness

F_MISS_A= fraction of sample missingness in cases

F_MISS_U= fraction of sample missingness in controls

F_MISS_DIF=difference between F_MISS in cases and controls

F_MISS_P=p values for case-control differences in F_MISS

Fhet=heterogeneity F statistic, also known as the inbreeding coefficient

HWE=Hardy-Weinberg equilibrium

MAF=minor allele frequency

TOP100=Top 100 genome-wide significant SNPs

TOP1000=Top 1000 genome-wide significant SNPs

n_observ=number of SNPs

Implementation Details

Important Notes

  • If you want to add another set of PLINK files or make other changes, create another working directory for these files and rerun preimp_dir_[version] in the new directory

  • If your datasets contain different ethnicities, split the PLINK files into each ancestry grouping and make separate working directories for each.

List of Options

Since version 7 you can choose QC thresholds yourself:

The values can have scientific format, so use --hwe_th_co 1.0e-300 to essentially switch off HWE testing in controls

Flags for extraordinary large files included in the version above Jan 25th 2019

Option

--sjatime_inc INT

--sjamem_incr INT

Description

add INT hour to walltime of imputation jobs (also on top of minilong)

increase all memore requests by INT Mb in steps of 1000 (1000 is 1Gb)

Disease Name Suggestions

ID Naming

FIDs (first column in PLINK .fam files) are renamed as follows:

pheno_disease_batch_popname_initials_PLATFORM*FID

Legend:

pheno: cas = case (2), con = control (1)

disease: the disease abbreviation specified by the --dis flag

batch: the 4 letter batch name abbreviation specified by disease.names

popname: the population type specified by --pop (default = european)

initials: your initials specified in the installation step

PLATFORM: best guess of platform data was genotyped on based on comparing SNP names to reference sets

Check the jobs with bjobs -w to see long jobnames

Look for _pri*, which is the driver script that keeps track of the single working scripts

Check file $HOME/preimp_dir_info regularly. it shows status updates of the pipeline.

FAQs

Please refer to FAQs preimputation for this.

For debugging tips please follow this document.

FID: original FID name

Helpful Tips