Post date: Aug 04, 2015 5:11:0 PM
We summarized the number of haplotypes (really just number of unique sequences without accounting for quality scores) per locus (cluster) and individual based on the consensus* sequence files. We would expect a greater number of individual x locus combination with 2+ haplotypes for triploids than diploids.
We extracted the number of haplotypes per locus x individual separately for diploids and triploids with the perl script getOverallLocusDepth.pl (this must be edited to specify the outfile):
perl getOverallLocusDepth.pl consensus_*D.fasta
perl getOverallLocusDepth.pl consensus_*T.fasta
These scripts generate the histogram files [dip|trip]DepthHist.txt where the first column gives the number of haplotypes and the second gives the number of loci (summed across individuals) with that number of haplotypes. We read these into R and examined and compared the distributions. We see more loci with 1 than 2 haplotypes, more with 2 than 3, etc. though there is a long tail to this distribution. There is a slight deficit of one haplotype loci for triploids compared to diploids (diff in proportion = -0.026) and an slight excess of 2 (0.004), 3 (0.011), etc. halplotype loci. The signal is not as strong as I had hoped, but it seems realistic and is probably (hopefully) strong enough to justify developing a formal model to parse diploids from triploids. This is what we will do next.