Post date: Dec 14, 2015 4:41:54 AM
Variant calling: Here our goal is to obtain a set of high quality SNPs that can be used for ploidy inference. Standard methods can be used for this, such as the Bayesian variant callers in samtools or GATK. At this stage, a single ploidy must be assumed for all individuals.
Variant filtering: Initial variants are cleaned-up based on coverage, etc. Uses vcfFilter.pl.
Calculate SNP-based variables: At present we use four SNP-based metrics for ploidy inference, heterozygosity (proportion of SNPs where the individual is heterozygous) and the proportion of SNPs that fall into each of three relative allele depth classes (at heterozyous SNPs). We might alter the latter to make it more general. At present this is all in R scripts, but this can be easily integrated into other scripts.
Define haplotype loci and alleles: This is the first step towards obtaining haplotype data. We first grab haplotype start positions with grabStarts.pl, which works on sam alignment files. We then use a series of perl scripts to extract the genetic data for these haplotype loci from the sam files, first combineHapLocusSnps.pl (which generates snpsPerHapLocus.txt; this includes scaffold and start of each haplotype locus, followed by the position and allele frequency of each SNP). We then grabe the subset of haplotype loci with 2-4 SNPs using calcSnpsPerLocus.pl. Finally, we run extractGeneticData.pl to extract haplotypes and quality scores for each individual and read contained with a haplotype locus. The key output is hapdata.txt.
Calculate haplotype-based variables: We have used two haplotype-based variables so far, the proportion of haplotype loci where an individual likely has 2+ or 3+ haplotype alleles. This is also done with R scripts but could be integrated easily into pipeline perl scripts.
Calculate ploidy probabilities. First, perform PCA on the SNP and haplotype-based variables. Go directly to LDA if a training set is available. If not, first use k-means clustering to develop a training set.