For variants with a minor allele frequency (MAF) greater than 1%, use this module of the Pipeline. This module uses the imputed dosage data from the Imputation step and performs association tests correcting for ancestry via principal components calculated in the PCA module. Finally, a meta-analysis is performed of all input datasets.
For variants with a MAF less than 1%, we are in the process of developing tools for Rare Variant Analysis.
This module performs association analyses for common variants from imputed dosage data for each dataset QC'd in the Preimputation step and then does a final meta-analysis using METAL. Population stratification is accounted for using principal components generated from the PCA step. It is highly recommended to go through this tutorial and test the module on example data.
1. Make sure the current directory is the directory where you ran the imputation step.
cd qc/imputation/
2. Copy the covariate file from the PCA step using the best guess genotypes to the imputation directory.
cp pcaer_sub/prune.bfile.cobg.OUTNAME_PCA.menv.mds_cov ./
Note: The covariate file can also be the PCs calculated from genotyped markers if you use the covariate file from qc/pcaer_sub/ rather than qc/imputation/pcaer_sub/
3. Run the following command to start the association analysis:
postimp_navi --out OUTNAME --mds prune.bfile.cobg.OUTNAME_PCA.menv.mds_cov --coco 1,2,3,4,9 --addout run1
with the required flags:
--out: Output file identifier for this project
--mds: A covariate file generated by the PCA step
--coco: The principal components to use in the analysis (In this example, PC1-4 and PC9 will be used)
The following flags are optional, but highly recommended:
--addout: An identifier added to the output name specifying which run and/or principal components used etc.
--nocon: No conditional analyses. Required if you only have one input dataset.
--nohet: No testing for heterogeneity. Required if you only have one input dataset.
Additional options are specified here.
4. If the script runs successfully, you'll see the following message:
------------------------------------------------------------
929 jobs successfully submitted
please see tail of $HOME/postimp_navi_info for regular updates
also check bjobs -w for running jobs
you will be informed via email if errors or successes occur
------------------------------------------------------------
5. Monitor the progress of the pipeline in your log file at loloc/postimp_navi_info, where loloc is the directory the user defined for log files in the installation step.
An example log file looks like the following with the following columns:
1 - current working directory
2 - exact command submitted to start the analysis
3 - step pipeline is starting
4 - date and time of record
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 daner.927 Tue_May_20_17:29:30_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 dameta.929 Tue_May_20_18:38:05_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 dameta.48 Tue_May_20_18:51:56_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 damecat.2 Tue_May_20_18:52:56_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 lth.2 Tue_May_20_18:56:45_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 areator.2 Tue_May_20_18:59:33_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 areaplot.93 Tue_May_20_19:18:30_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 forestplot.93 Tue_May_20_19:20:50_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 forestplot.4 Tue_May_20_19:22:54_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 manhplot.4 Tue_May_20_19:23:56_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 qqplot.2 Tue_May_20_19:24:57_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 lahu.10 Tue_May_20_21:10:02_2014
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 finished Thu_May_22_16:01:44_2014
6. When the pipeline is finished, you will receive an email and the word "finished" will appear in your log file
Example email:
##################################################################
##### CONGRATULATIONS:
##### rp_pipeline_postimp finished successfully:
##### ~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3
##### have a look at the distributions subdir for output files
##### have a look at the wiki page for more details
##### https://sites.google.com/a/broadinstitute.org/ricopili/
##################################################################
Example success message in loloc/postimp_navi_info:
~/rp_out/qc/imputation postimp_navi --out BURT_TEST --mds prune.bfile.cobg.testimp.menv.mds_cov --coco 1,2,3,4 --addout pc14.run3 finished Thu_May_22_16:01:44_2014
7. If you receive an error message via email, look at the debugging tips for help determining where the error occurred.
Example error message:
##################################################################
##### Error:
##### step forestplot has been done repeatedly without any progress
##### pipeline stopped
##### if reason does not appear obvious
##### have a look at the wiki page
##### https://sites.google.com/a/broadinstitute.org/ricopili/
##### or contact the developers
##################################################################
8. Look at the output files to make sure everything seems reasonable.
All output will be in the distribution/OUTNAME_ADDOUT directory where OUTNAME is specified by --out and ADDOUT is specified by --addout. See the Output Files section for more details on what is contained in each file.
Note: If you have multiple datasets, then you can find the association results for each single study here: report_OUTNAME_ADDOUT/ daner_disease_batch_popname_initials-qc.hg19.ch.fl.gz, where the following variables were defined in the Pre-imputation/QC step: disease = 3 letter disease abbreviation, batch = 4 letter cohort/dataset abbreviation, popname = population ancestry, initials = initials as defined in the ricopili.config file generated in the installation step
All files are in the directory distribution/OUTNAME_ADDOUT where OUTNAME is specified by --out and ADDOUT is specified by --addout
Example of daner_OUTNAME_ADDOUT.gz.p4.clump.areator.sorted.1mhc.xls:
The legend for the columns is as follows:
SNP=name of variant; CHR=chromosome; BP=base pair position (hg19); P=P-value; OR=odds ratio for allele 1; SE=standard error; A1A2=allele 1 and allele 2; FRQ_A=frequency of allele1 in affected cases; FRQ_U=frequency of allele1 in unaffected controls; INFO=imputation info score; ngt=number of studies in which this variant was genotyped (vs. imputed); “friends(.1).p0.001”=list of all variants with LD-r2 > 0.1 to index SNP, in brackets LD-r2 and distance in kb sorted by LD-r2; range.left=left margin of region (defined by LD friends); range.right=right margin of region (defined by LD friends); span(kb)=right margin - left margin (in kb); “friends(.6).p0.001”, range.left.6, range.right.6, and span.6(kb)=as before but with LD-r2 of 0.6; gwas_catalog_span.6=list of entries in NHGRI GWAS catalogue among entries in column friends(.6); “genes.6.50kb(dist2index)”=list of genes within the region of friends.6 (±50 kb), in brackets distance to index SNP in kb.
Example of areas.OUTNAME_ADDOUT.pdf.gz:
For the areas.*_*.pdf.gz plots (output of common variant analysis), if >1 independent SNPs are in the same plot, then each is given a different color, and their LD partners are shaded with the same color. Detailed info about each independent index SNP (a., b., c. ...) is provided in the right upper corner. if SNPs share independent index SNPs as an LD partner, the SNP is assigned to the more significant one
Black dots inside of colored dots denote non-HM3 SNPs. This was interesting when 1KG was still new. but kind of unnecessary now.
Example of areas.fo.OUTNAME_ADDOUT.pdf.gz:
The forest plot shows the Odds Ratio and Standard error for each study as well as for the combined meta-analysis.
Example of daner_OUTNAME_ADDOUT.gz: (for more details see here)
CHR SNP BP A1 A2 FRQ_A_297 FRQ_U_186 INFO OR SE P ngt
10 rs187110906 60969 A C 0.0874 0.0691 0.3887 1.9147 0.4206 0.1225 0
10 rs12260013 66326 A G 0.9630 0.9761 0.4549 0.2792 0.6590 0.05285 0
10 chr10_66627_D 66627 I5 D 0.4726 0.5216 0.5727 0.7064 0.1776 0.05028 0
Columns are as follows:
CHR = chromosome number
SNP = SNP name
BP = Base pair position
A1 = Allele 1 (Odds Ratios calculated with respect to this allele)
A2 = Allele 2
FRQ_A_#Cases = Minor allele frequency (A1) in cases
FRQ_U_#Controls = Minor allele frequency (A1) in controls
INFO = quality of imputation for the SNP (usually keep SNPs with an INFO score > 0.6)
OR = Odds ratio
SE = Standard Error
P = P-value
ngt = Number of Datasets where the SNP was genotyped directly
*** If there are more than 1 input datasets, there will also be a column with a string of +,-,? corresponding to the sign of the odds ratio for each study
Please refer to faqs-postimputation for this module.
For debugging tips please follow this document.