LDIV: IMPROVED GENOTYPE CALLING

LDIV is an accurate and fast genotype calling method for Next Generation Sequencing (NGS) data.

LDIV enhanced genotype imputation and phasing in both family-based sequencing data and general population sequencing data, by leveraging identity-by-descent (IBD) and linkage disequilibrium (LD) information.

RELEASE NOTE & DOWNLOAD

Current version is 1.2.4

v1.2.4 Handle mixed cohorts with individuals both from families and that are unrelated.

v1.2.3 Stable version and solved the free empty pointer issue.

To download and use LDIV, before its public release, please contact the author at fangz[dot]ark[at]gmail[dot]com

Where Should LDIV be Used in an NGS Pipeline?

We take GATK pipeline as a typical example of NGS data analysis pipeline

LDIV takes in genotype calls from other software (e.g. GATK, Beagle4, etc.) in a vcf file, and serves as a genotype calling (recalling) software to enhance genotyping accuracy. The main input and output are both vcf files (shown above on the right panel).

GETTING STARTED

Some simple steps to get you started for running LDIV on a UNIX based server with basic command.

First, you should prepare your input files as below (available from the output of conventional tools like GATK):

INPUT & OUTPUT

Basic input files of LDIV

vcf file, from general NGS pipeline e.g. GATK.

ped file, in PLINK format.

map file, in PLINK format.

Recommended optional input files

refvcf file, reference panel in vcf format. Containing usually deeply sequenced samples from population.

map file, in PLINK format.

Basic Output files

vcf file, one, or more if multiple imputation parameter is on.

Then to test whether LDIV is successfully installed and operational on your machine, take a smaller sample (or subsample your original vcf files with, e.g., top 1000 SNPs) and perform this test script:

A SIMPLE EXAMPLE SCRIPT

Running LDIV with basic commands (test input could be subsampled from your own data, or download and unzip the test_data.zip):

./LDIV --vcf $PATH/test.vcf --ped $PATH/test.ped --prefix $PATH/nameOfOutput --ldiv --states 10 --rounds 10

where the $PATH is the directory where the downloaded testing example are, e.g. ~/LDIVtest/

Running LDIV with reference panels:

./LDIV --vcf $PATH/test.vcf --ped $PATH/test.ped --map $PATH/test.map --ref $PATH/test_ref.vcf --prefix $PATH/nameOfOutput --ldiv --states 10 --rounds 10

Please note that this example is for speedy and convenient testing only. In real work states is recommended to be set to 20 and higher, depend on the available computational power.

More options and parameters could be found in next section.

If the previous step goes through smoothly with the output specified in input&output section, in most cases you should be confident to go ahead with your analysis using options tailored to your analysis needs.

The following sections describe the occasions where LDIV encounters problems and prepare pipelines and commands in helping you solve them:

Split VCF by chromosome before running LDIV (optional)

Note that, similar to other computational genetics tools, to better allow for parallel computing on all chromosomes, LDIV allows each input vcf file and reference vcf file to contain only one chromosome.

Please split vcf by chromosome before running LDIV. Common tools and simple scripts could be used for this purpose, for example:

Option1 (vcftools)

seq 1 22 | parallel "vcftools --recode --gzvcf 1000genomes.vcf.gz --chr {} --out chr{}"

Option2 (tabix)

bgzip -c myvcf.vcf > myvcf.vcf.gz

tabix -p vcf myvcf.vcf.gz

tabix myvcf.vcf.gz chr1 > chr1.vcf

Split VCF file in case of memory shortage (optional)

LDIV's memory usage is reasonable for most of the modern servers. Yet occasionally on machine with limited memory, if one would perform LDIV on whole genome sequencing data (e.g. >50k SNPs per chromosome), with large reference file ( > 2k reference individuals), allocating memory could fail. To solve this, instead of directly running LDIV, please run the LDIVpipeline.sh attached, with same sets of input arguments as LDIV.

For example, the above simple example can be:

./LDIVpipeline.sh --vcf=$PATH/test.vcf --ped=$PATH/test.ped --prefix=$PATH/nameOfOutput --ldiv --states=10 --rounds=10 --split=10000 --src=$PATHtoSRC

A few things to note for the use of LDIVpipeline.sh instead of LDIV

Use equals instead of space to separate input.

--split The maximum #SNPs that you allow in sub vcf files split from the input vcf.

--src The path to the LDIV the executable.

Matching n Trimming reference file (optional)

In the case reference vcf file is much larger than input vcf file (the vcf file to be genotyped), matching and trimming reference file by markers according to input vcf file could save loading time.

This can be done by attached small bash scripts matchNtrim.sh

./matchNtrim.sh input.vcf ref.vcf newRef.vcf

USER MANUAL

A list of basic commands could be found as follows:

--vcf [ ] Standard VCF file (4.0 and above).

--ped [ ] Pedigree file in PLINK format.

--map [ ] Map file in PLINK format.

--prefix [ ] The prefix of output file

--states [ ] The number of haplotyes used in the state space. The default is the maximum number.

The most tolerant range of #states is [2, 2*(#Founders-1)]. Yet we strongly suggest setting #states > 10 for LDIV and FamLDCaller whenever possible. If the number of founders are too little, we suggest user to use reference haplotypes (with --refvcf), or use --polymutt option.

--rounds [ ] The total number of iterations.

--refvcf [ ] Reference panel in standard VCF file format (4.0 and above) to facilitate genotype calling, especially for common variants. It can be possibly downloaded from a public genotyping project, e.g. 1000G project.

--ldiv Use LDIV for genotype calling and automatic methods decision. Default [on].

--famldcaller Force the use of FamLDCaller⁶ for genotype calling. Default [off].

--polymutt Force the use of Polymutt2⁴ for genotype calling. Default [off].

--nthreads [ ] Number of threads used for parallel computing.

Other advanced commands could be found when you execute LDIV with no input, i.e.: ./LDIV

INTRODUCTION TO GENOTYPING

Here we present the two perspective to look at genotype calling, which are what our method is based on.

Making use of linkage disequilibrium (LD) information, viewing the haplotype as many mosaics^1,2,3
Making use of the identity-by-descent (IBD) information by constructing inheritance vectors within a family⁴

Fig. 1 Simple illustration of haplotype mosaics (left) and IBD in pedigree ⁵(right).

REFERENCES

Li N, Stephens M. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. 779 Genetics. 2003;165(4):2213–33.
Chen W, Li B, Zeng Z, Sanna S, Sidore C, Busonero F, et al. Genotype calling 765 and haplotyping in parent-offspring trios. Genome Res. 2013;23(1):142–51.
Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage 767 sequencing: Implications for design of complex trait association studies. 768 Genome Res. 2011;21(6):940–51.
Li B, et al. Leveraging Identity-by-Descent for Accurate Genotype Inference in Family Sequencing Data. PLoS Genet 2015; 11: e1005271
International Society of Genetic Genealogy Wiki
Chang LC, Li B, Fang Z, Vrieze S, McGue M, Lacono WG, Tseng GC, Chen W. A computational method for genotype calling and haplotyping for family-based sequence data. BMC Bioinformatics 2016. 16;17:37

GRANT R01HG007358, R01HG006857

Google Sites

Report abuse