Imputation
Overview
This module performs imputation on binary PLINK datasets generated by the Preimputation-QC step. The output is a set of dosage probabilities for all markers in a user-specified reference panel (there are a number of reference panels to choose from including MHC classical alleles and amino acids, HapMap, and 1000 Genomes. It is highly recommended to go through this tutorial before imputing your data.
Technical Details
From Ricopili version rp_bin.2017_Nov_30.003.tar.gz (link to versions):
1. Prephasing is done using Eagle v2.3.5.
2. Imputation is done using Minimac3.
In prior versions of Ricopili:
1. Prephasing is done using SHAPEIT. All samples are pre-phased together.
2. Imputation is done using IMPUTE2. Imputation is done in smaller chunks of 300 samples at a time.
Usage Instructions
1. Change into the imputation directory created by the Preimputation QC Step:
cd ~/RicopiliTraining/preimputation/aber/qc/imputation/
OR make a new directory and make symbolic links to all binary PLINK files from the Pre-Imputation/QC step that you want to impute into the new directory. An example is shown below:
mkdir imputation/
cd imputation/
ln -s ~/RicopiliTraining/preimputation/aber/qc/imputation/scz_aber_eur_hw.{fam,bim,bed} .
ln -s ~/RicopiliTraining/preimputation/ajsz/qc/imputation/scz_ajsz_eur_hw.{fam,bim,bed} .
ln -s ~/RicopiliTraining/preimputation/asrb/qc/imputation/scz_asrb_eur_hw.{fam,bim,bed} .
2. Choose which reference panel you want to use for imputation from the options listed here:
3. Check the N of your QC'd fam file:
wc –l ~/RicopiliTraining/preimputation/aber/qc/imputation/scz_aber_eur_hw-qc.fam
4. Enter one of the commands that follows, depending on your sample size, to start the Imputation module:
impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3
For N > 2000 but ≤ 3500 add the option --minilong, i.e.:
impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3 --minilong
For N > 3500 add the options –-minilong –-multi <INT>:
impute_dirsub --refdir /home/gwas/pgc-samples/hapmap_ref/impute2_ref/1000GP_Phase3_sr_0517d --out aber --minimac3 --minilong --multi 8
Where <INT> is an integer value, and aber is the output name you have given to the imputation.
Minilong is an instruction to use long jobs for Minimac imputation i.e. 24 hours. This is better than multithreading. Multi instructs the job to use multithreading of INT cores for prehasing (i.e., to run concurrent processes for job efficiency).
Note, for prior versions of Ricopili, the –-minilong and –-multi options are not available.
Additional options are located here.
5. If everything ran successfully, jobs will be submitted to the cluster and you'll see output like the following:
If not, see the Debugging Tips section below.
6. You can monitor the imputation progress progress by seeing if the jobs are in the queue (still going) or not (finished) (line1), and second, check the output logs and look for the “finished” notification in the status (third column) of the Ricopili log file for the impute_dir_info script (line 2 below). Alternatively, await an email, or check to see if “success_file” has been outputted (line 3 below). To see how to locate the output log file, check here.
qstat -u <user>
tail ~/ricopililogfiles/impute_dir_info
ls ~/RicopiliTraining/qc/imputation/success_file
Where <user> is your username.
Notes
The log file loloc/impute_dir_info contains a record of all commands submitted, where loloc is the directory specified by the user in the installation step where output log files are stored. You can check the location by opening the ricopili.conf file and looking up the “loloc” row.
The impute_dir_info log file is formatted as follows:
The first column is the working directory the command was submitted from.
The second column is the exact command entered to start imputation
The third column is a flag to let you know what step the pipeline is at. "finished" means the pipeline completed successfully
The last column is the date and time the message was printed.
Example Log File:
You will get an email when it is finished, based on the email you listed in the ricopili.conf file, if your cluster allows emails to be sent to this address.
Example email:
If you receive an email stating there is an error in the pipeline, then check the debugging tips below.
A description of the output files generated is listed below.
7. If you want to generate principal components for all batches together, do the following:
a. Create a new working directory for the PCA
mkdir ~/RicopiliTraining/PCA_all/
b. Copy the files of the file name format below from the /imputation/pcaer_sub/ directories. Run the PCA step using these files, which are the LD-pruned data generated from best guess genotype calls from imputation.
cd ~/RicopiliTraining/PCA_all/ cp ~/RicopiliTraining/preimputation/aber/qc/imputation/pcaer_sub/prune.bfile.cobg.aber.bim .
cp ~/RicopiliTraining/preimputation/ajsz/qc/imputation/pcaer_sub/prune.bfile.cobg.ajsz.bim .
cp ~/RicopiliTraining/preimputation/asrb/qc/imputation/pcaer_sub/prune.bfile.cobg.asrb.bim . pcaer --prefercase --out prune.bfile.cobg.PCA_all prune.bfile.cobg.aber.bim prune.bfile.cobg.ajsz.bim prune.bfile.cobg.asrb.bim
The complete instructions for running PCA are provided here.
List of Options
Output Files (please also refer to teaching material, focus on imputation slides 35 and 38)
Legend:
disease: the disease abbreviation specified by the --dis flag from the preimputation/QC step
batch: the 4 letter batch name abbreviation specified by disease.names from the preimputation/QC step
popname: the population type specified by --pop (default = european) from the preimputation/QC step
initials: your initials specified in the installation step
N: chromosome number of data in the file
START: 3 digit start position in MB (000 is the beginning of the chromosome)
END: 3 digit end position in MB (this is always START + 003 because imputation is done in 3 MB chunks)
OUTNAME: output file name specified by the --out flag
Combining Ricopili Directories
create a new empty directory and change into it (will be your new rootdirectory for the combined collection)
my.joinimp2 DIR1 DIR2 DIR3 ......
Here some remarks:
DIR[...] are the directories with the former imputation runs (you best grep for finished in your impute_logfile). please use full path (starting with "/", not something like this "../../").
the script will do some analysis and linking. if successful you get a success message and detailed instructions how to proceed: you have to start the impute - script again with a new name (it will only perform the last steps, not the imputation).
none of the original files are touched/moved. so if not successful just delete the whole directory and restart.
the script itself should only run for max. 2 mins.
FAQs
Please refer to faqs-imputation for this module.
For debugging tips please follow this document.