Post date: Aug 24, 2015 4:30:42 PM
We first need to define loci for calling haplotypes. It looks like this can mostly be done by treating a locus as all reads that start within 100 bp of each other.
So, here is the initial plan:
grab all of the unique start points (which will be a mix of starts and stops)
find the outer bounds of each set where all unique starts are within 100 bps of each other
the outer bounds will delineate the loci for haplotype calling
we will extract variable sites from each read along with quality scores for all SNPs within a haplotype, filling in with Ns any cases where no data exist for a read (give these bad quality scores = all bases equally likely, or 100 % chance of error)
this will be the input for our model