Post date: Aug 27, 2015 3:26:48 PM
We defined haplotype loci as follows:
start from the left-most read start point on a scaffold, this is the start-point for the first locus
proceed one (sorted, unique) read at a time, if the start if within $nbp = 100 of the defined start of the locus it is part of the same locus, if not the start point of the read defines the start of the next locus
continue until the end of scaffold, and then repeat for the next scaffold
The script for this is grabStarts.pl (/labs/evolution/data/aspen/gbs/Assemblies/Scripts) and it writes the ouftile hapLocusStarts.txt. This has two columns, one with the scaffold and one with the start, with one haplotype locus per row. Here is the command we used to run it on the aspen data:
perl Scripts/grabStarts.pl aln*sam
This generate 303,667 potential haplotype loci.