Merge Calling Algorithms

Overview

This tool consists of one perl script that takes genotype calls from multiple genotype calling algorithms in binary PLINK format and merges them into one call set. It was specifically written for the following algorithms: GenCall (AutoCall), Birdseed, and zCall. All PLINK commands are run using PLINK2.

Technical Details

Samples with missing phenotypes or unknown status (-9 in the phenotype file) are removed from the data. In addition, duplicate samples are renamed with 1…N-1 "d"s appended to the ID name where N is the number of duplicates of that sample. For example, if NA12878 is present four times, then the following sample names will appear in the final dataset: NA12878, NA12878d, NA12878dd, NA12878ddd.

Depending on what calling algorithms were used in the input files, the script does the following:

Birdseed Only:

No QC is done. A copy of the input Birdseed file is made.

GenCall Only:

No QC is done. A copy of the input GenCall file is made.

zCall Only:

The following error message is given and the script exits:

Only detected zCall input... Must have GenCall input too!

Birdseed + zCall:

No QC is done. A copy of the input Birdseed file is made. zCall can only be used if GenCall is also present.

Birdseed + GenCall:

1. Determine which samples overlap between the two files. Samples that are not present in both call sets are removed.

2. Determine samples that fail in each algorithm. First, SNPs that have call rate less than 90% are removed. Next, sample call rates are calculated. Samples that have a call rate less than 95% (or a user-defined threshold specified by --mind) are considered failing. Samples are removed from both GenCall and Birdseed if they failed in both GenCall & Birdseed or in only one of the callsets.

3. QC of GenCall data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno).

4. QC of Birdseed data. Same as for GenCall data.

5. Merge of QC'd GenCall & QC'd Birdseed data using merge-mode 1 in PLINK.

6. QC of Merged data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno). Calculate the post-merge sample call rate and remove any samples with call rate less than 95% (or user-defined threshold by --mind).

GenCall + zCall:

1. Determine which samples overlap between the two files. Samples that are not present in both call sets are removed.

2. Determine samples that fail in GenCall data. First, SNPs that have call rate less than 90% are removed. Next, sample call rates are calculated. Remove samples that have a call rate less than 95% (or a user-defined threshold specified by --mind). The failing samples are also removed from the zCall dataset.

3. QC of GenCall data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno).

4. Determine which SNPs in the QC'd GenCall dataset are rare (default MAF < 1% or user-defined by --maf) and extract them from the zCall dataset.

5. QC of rare subset of zCall data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno). Calculate the minor allele frequency of each SNP. If the minor allele frequency after zCall is greater than 5% (or user-defined threshold by --maxmaf) after zCall, then the GenCalls (and not the zCalls) are used for that SNP.

6. Merge QC'd GenCall data with the QC'd set of zCall data.

Birdseed + GenCall + zCall:

1. Determine which samples overlap between the two files. Samples that are not present in both call sets are removed.

2. Determine samples that fail in each algorithm. First, SNPs that have call rate less than 90% are removed. Next, sample call rates are calculated. Samples that have a call rate less than 95% (or a user-defined threshold specified by --mind) are considered failing. Samples are removed from both GenCall and Birdseed if they failed in both GenCall & Birdseed or in only one of the callsets.

3. QC of GenCall data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno).

4. QC of Birdseed data. Same as for GenCall data.

5. Merge of QC'd GenCall & QC'd Birdseed data using merge-mode 1 in PLINK.

6. QC of Merged data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno). Calculate the post-merge sample call rate and remove any samples with call rate less than 95% (or user-defined threshold by --mind).

7. Determine which SNPs in the QC'd merged GenCall & Birdseed dataset are rare (default MAF < 1% or user-defined by --maf) and extract these SNPs from the zCall dataset.

8. QC of rare subset of zCall data. Calculate the Hardy Weinberg Equilibrium p-value for each snp and removing snps with p < 1e-6 (or user-defined threshold by --hwe). Calculate the call rate per SNP and remove SNPs with call rate less than 97% (or user-defined threshold by --geno). Calculate the minor allele frequency of each SNP. If the minor allele frequency after zCall is greater than 5% (or user-defined threshold by --maxmaf) after zCall, then the merged Birdseed & GenCalls (and not the zCalls) are used for that SNP.

6. Merge common variants from the QC'd GenCall+Birdseed data (MAF > 5% or user-defined by --maf) with the QC'd set of zCall data (rare SNPs only).

Input Requirements

1. Binary PLINK files for each calling algorithm per dataset. For example, if you have 2 datasets and have GenCall, Birdseed, and zCall genotype calls, you should have 3x2 or 6 files in total.

2. Each binary PLINK file must have some identifier of the calling algorithm used to generate the calls in that file. Acceptable identifiers are as follows (not case sensitive):

Birdseed: *bird* or *BScall*

GenCall: *auto* or *gen* or *bead*

zCall: *zcall*

3. Make sure the file naming system is consistent for each set of PLINK files such that you can isolate all calls for the same dataset using a wildcard character in place of the calling algorithm name.

Example (myproject.*):

myproject.birdseed.[bed,bim,fam]

myproject.zcall.[bed,bim,fam]

myproject.gencall.[bed,bim,fam]

Usage Instructions

1. Make a directory and copy the binary PLINK files for each genotype calling algorithm you have available into the directory.

mkdir merge_data/

cp myproject1.gencall.* merge_data/

cp myproject1.zcall.* merge_data/

cp myproject1.birdseed.* merge_data/

You can also use symbolic links to the original file as well.

2. Change into the directory you created.

cd merge_data/

3. Run the following command:

mergecall_8 myproject1.*.* --out project1_merge --pheno my.pheno

--out specifies the output file root.

--pheno specifies the phenotype file (FID, IID, phenotype)

For a complete list of options, see here.

Make sure the output file root doesn't conflict with the input file wildcard you specified in case you run the script again. To check what files are input into the script, use the following command:

ls myproject.*.*

Only one dataset can be merged at a time using this script as specified by the file root with a wildcard in it.

4. Look at the output files and make sure everything looks correct.

Rerun the script with different parameters and a different output file root name if needed.

List of Options

Output Files

out: output root specified by --out

date: date in the format of year_month_day

time: time in the format of hour_minute_second

Debugging Tips

FAQ