PCA

Overview

This module takes PLINK binary output file from the Preimputation/QC step and calculates the principal components, determines overlapping samples, determines which covariates are associated with the genotype data, and generates PCA plots a to check the ancestry of the cohorts and to exclude ancestry outliers. Please go (highly recommended) through this helpful tutorial for understanding the application of this module in the real world analysis.

Technical Details

1. To assess relatedness between samples and population stratification, we used the chip-wide SNP data passing stringent quality control.

- SNPs are found in all datasets

- MAF > 5%

- HWE > 1.0e-03

- MISSING RATE < 2%

- no AT/GC SNPs (strand ambiguous SNPs)

- no MHC (6:25-35Mb)

- no Chr.8 inversion (8:7-13Mb)

2. We then pruned SNPs to ensure that there is little linkage disequilibrium between SNPs (R2 < 0.2).

- LD - R2 < .2, 200 SNPs window: plink --indep-pairwise 200 100 0.2 http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#prune

- repeat LD pruning with resulting LD pruned dataset

- if still over 100K SNPs (rare) prune randomly

3. We used the resulting SNPs to assess recent common ancestry and population stratification with Eigenstrat.

Usage Instructions

1. Make a new working directory for the PCA module, with two new subdirectories. First, we are going to first do a PCA with reference data, and then follow this up with a PCA without reference data. Change into the new directory 'pcawithref'.

mkdir ~/RicopiliTraining/pca/ ~/RicopiliTraining/pca/pcawithref ~/RicopiliTraining/pca/pcawithoutref

cd ~/RicopiliTraining/pca/pcawithref

2. Make symbolic links of your QC'd PLINK files into the new subdirectory

ln -s ~/RicopiliTraining/preimputation/aber/qc/scz_aber_eur_hw-qc.{fam,bim,bed}

3. Run the PCA Module with the following command:

pcaer --out output_name bfile-qc.bim

To run PCA on multiple files, use the following:

pcaer --out output_name bfile1-qc.bim bfile2-qc.bim bfileN-qc.bim

For a list of other options, see here.

4. There will be many iterations of multiple jobs submitted, please check point (5) regularly.

5. Check your log file to determine if the jobs finished successfully.

  • If everything has completed successfully, you should see a message similar to the one below in loloc/pcaer_info, where loloc is the directory specified by the user for the log files in the installation step:

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim finished Fri_May__9_11:13:38_2014


The first column is the working directory the command was submitted from.

The second column is the exact command entered to start PCA.

The third column is a flag to let you know what step the pipeline is at. "everything-is-fine" means the pipeline completed successfully

The fourth column is the date and time the message was printed.

Example Log File:

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim merge_300 Wed_May__7_13:41:32_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim pruning_prep Wed_May__7_19:26:25_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim pruning_1 Wed_May__7_19:27:35_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim pruning_bed Wed_May__7_19:28:06_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim pruning_2 Wed_May__7_19:28:40_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim merge_pruned Wed_May__7_19:29:11_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim genome.1 Wed_May__7_19:31:34_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim report-overlap Thu_May__8_11:37:24_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim epca.1 Thu_May__8_11:41:41_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim asso.20 Thu_May__8_14:40:45_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim gwapl.20 Thu_May__8_14:42:28_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim gwapl.13 Thu_May__8_14:44:41_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim qqpl.20 Thu_May__8_14:48:57_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim lahu.40 Thu_May__8_14:58:00_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim started-pca-plot Fri_May__9_11:10:40_2014

~/rp_out/qc pcaer_13 --out burtlab add_burt_eur_jg-qc.bim finished Fri_May__9_11:13:38_2014

  • If the pipeline did not finish successfully, look at the debugging tips here to determine where the error occurred.

6. Look at the output files in qc/

output_name is the text entered at the command line with the flag --out.

A description of all output files is available here.

A list of overlapping samples is also available in the tarball output_name.menv.mds.tar.gz can be extracted:

tar -xvzf output_name.menv.mds.tar.gz

Legend:

output_name: the output name text specified by the --out flag

The most important files you want to look initially are the following:

  • output_name.menv.mds.2ds.pdf.gz ## Plots of most significant PCs against other PCs (2-D)

  • output_name.menv.mds.sum.pdf ## Shows which principal components are significant

To determine whether the output looks correct:

a. The value of lambda should decrease as the principal components increase in output_name.menv.mds.sum.pdf. See an example below from SCZ meta-analysis.

If not, then there could be a problem with LD pruning.

b. There should not be manhattan-like peaks in the output_name.menv.mds.asso-nup.pdf.gz file,

If peaks are present, then there could be a problem with LD pruning. see below an example of how they should look like (first four PCAs).

7. If you have more than one population present after looking at the PCA plots, divide up your samples into African, Asian, and European subsets and repeat Preimputation/QC and PCA from the original datasets.

For example, here's a dataset with multiple populations where the densest cluster represents samples of European ancestry.

We can use awk to isolate samples in the European cluster based on the requirements PC1 is greater than 0.005 and less than 0.03 and PC2 is greater than 0.01.

awk '{ if ($4 >= 0.005 && $4 <= 0.03 && $5 >= 0.01) print $1,$2 }' output_name.menv.mds_cov > european_samples.txt

Next, we can use PLINK to subset the European samples from the original PLINK files from the Preimputation/QC step.

plink --bfile my.original.bfile --keep european_samples.txt --make-bed --out my.european.bfile

The same procedure can be used to make subsets of samples of Asian or African ancestry. To rerun Preimputation/QC specifying an Asian population, use --pop asn. For an African population, use --pop aam.

8. If one population is present, determine which samples to keep for imputation.

For example, here's a plot of PC1 versus PC2 for a European population.

We want a homogenous group of samples for analysis, so we may want to remove outliers that could confound our analyses such as samples with PC1 greater than 0.05 or less than -0.05 or PC2 greater than 0.3.

List of Options

FAQs

Please refer to this page for PCA FAQs.

For debugging tips please follow this document.