Imputation

Overview

This module performs imputation on binary PLINK datasets generated by the Preimputation-QC step. The output is a set of dosage probabilities for all markers in a user-specified reference panel (there are a number of reference panels to choose from including MHC classical alleles and amino acids, HapMap, and 1000 Genomes. It is highly recommended to go through this tutorial before imputing your data.

Technical Details

From Ricopili version rp_bin.2017_Nov_30.003.tar.gz (link to versions):

1. Prephasing is done using Eagle v2.3.5.

2. Imputation is done using Minimac3.

In prior versions of Ricopili:

1. Prephasing is done using SHAPEIT. All samples are pre-phased together.

2. Imputation is done using IMPUTE2. Imputation is done in smaller chunks of 300 samples at a time.

Usage Instructions

1. Change into the imputation directory created by the Preimputation QC Step:

cd ~/RicopiliTraining/preimputation/aber/qc/imputation/

OR make a new directory and make symbolic links to all binary PLINK files from the Pre-Imputation/QC step that you want to impute into the new directory. An example is shown below:

mkdir imputation/

cd imputation/

ln -s ~/RicopiliTraining/preimputation/aber/qc/imputation/scz_aber_eur_hw.{fam,bim,bed} .

ln -s ~/RicopiliTraining/preimputation/ajsz/qc/imputation/scz_ajsz_eur_hw.{fam,bim,bed} .

ln -s ~/RicopiliTraining/preimputation/asrb/qc/imputation/scz_asrb_eur_hw.{fam,bim,bed} .

2. Choose which reference panel you want to use for imputation from the options listed here:

3. Check the N of your QC'd fam file:

wc –l ~/RicopiliTraining/preimputation/aber/qc/imputation/scz_aber_eur_hw-qc.fam

4. Enter one of the commands that follows, depending on your sample size, to start the Imputation module:

impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3

For N > 2000 but ≤ 3500 add the option --minilong, i.e.:

impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3 --minilong

For N > 3500 add the options –-minilong –-multi <INT>:

impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3 --minilong --multi 8

Where <INT> is an integer value, and aber is the output name you have given to the imputation.

Minilong is an instruction to use long jobs for Minimac imputation i.e. 24 hours. This is better than multithreading. Multi instructs the job to use multithreading of INT cores for prehasing (i.e., to run concurrent processes for job efficiency).

Note, for prior versions of Ricopili, the –-minilong and –-multi options are not available.

Additional options are located here.

5. If everything ran successfully, jobs will be submitted to the cluster and you'll see output like the following:

If not, see the Debugging Tips section below.

6. You can monitor the imputation progress progress by seeing if the jobs are in the queue (still going) or not (finished) (line1), and second, check the output logs and look for the “finished” notification in the status (third column) of the Ricopili log file for the impute_dir_info script (line 2 below). Alternatively, await an email, or check to see if “success_file” has been outputted (line 3 below). To see how to locate the output log file, check here.

qstat -u <user>

tail ~/ricopililogfiles/impute_dir_info

ls ~/RicopiliTraining/qc/imputation/success_file

Where <user> is your username.

Notes

The log file loloc/impute_dir_info contains a record of all commands submitted, where loloc is the directory specified by the user in the installation step where output log files are stored. You can check the location by opening the ricopili.conf file and looking up the “loloc” row.

The impute_dir_info log file is formatted as follows:

  • The first column is the working directory the command was submitted from.

  • The second column is the exact command entered to start imputation

  • The third column is a flag to let you know what step the pipeline is at. "finished" means the pipeline completed successfully

  • The last column is the date and time the message was printed.

Example Log File:

You will get an email when it is finished, based on the email you listed in the ricopili.conf file, if your cluster allows emails to be sent to this address.

Example email:

If you receive an email stating there is an error in the pipeline, then check the debugging tips below.

A description of the output files generated is listed below.

7. If you want to generate principal components for all batches together, do the following:

a. Create a new working directory for the PCA

mkdir ~/RicopiliTraining/PCA_all/

b. Copy the files of the file name format below from the /imputation/pcaer_sub/ directories. Run the PCA step using these files, which are the LD-pruned data generated from best guess genotype calls from imputation.

cd ~/RicopiliTraining/PCA_all/ cp ~/RicopiliTraining/preimputation/aber/qc/imputation/pcaer_sub/prune.bfile.cobg.aber.bim .

cp ~/RicopiliTraining/preimputation/ajsz/qc/imputation/pcaer_sub/prune.bfile.cobg.ajsz.bim .

cp ~/RicopiliTraining/preimputation/asrb/qc/imputation/pcaer_sub/prune.bfile.cobg.asrb.bim . pcaer --prefercase --out prune.bfile.cobg.PCA_all prune.bfile.cobg.aber.bim prune.bfile.cobg.ajsz.bim prune.bfile.cobg.asrb.bim

The complete instructions for running PCA are provided here.

List of Options

Output Files (please also refer to teaching material, focus on imputation slides 35 and 38)

Legend:

disease: the disease abbreviation specified by the --dis flag from the preimputation/QC step

batch: the 4 letter batch name abbreviation specified by disease.names from the preimputation/QC step

popname: the population type specified by --pop (default = european) from the preimputation/QC step

initials: your initials specified in the installation step

N: chromosome number of data in the file

START: 3 digit start position in MB (000 is the beginning of the chromosome)

END: 3 digit end position in MB (this is always START + 003 because imputation is done in 3 MB chunks)

OUTNAME: output file name specified by the --out flag

Combining Ricopili Directories

  1. create a new empty directory and change into it (will be your new rootdirectory for the combined collection)

  2. my.joinimp2 DIR1 DIR2 DIR3 ......

Here some remarks:

  • DIR[...] are the directories with the former imputation runs (you best grep for finished in your impute_logfile). please use full path (starting with "/", not something like this "../../").

  • the script will do some analysis and linking. if successful you get a success message and detailed instructions how to proceed: you have to start the impute - script again with a new name (it will only perform the last steps, not the imputation).

  • none of the original files are touched/moved. so if not successful just delete the whole directory and restart.

  • the script itself should only run for max. 2 mins.

FAQs

Please refer to faqs-imputation for this module.

For debugging tips please follow this document.