Post date: Aug 04, 2015 3:3:37 PM
We are using vsearch to cluster sequences within individuals to count the number of 'haplotypes' at each locus for each individual. This will include haplotypes created by sequence errors, so we are really only interested in the relative number of haplotypes for known diploids vs. triploids. If this looks promising we will follow-up with a model-based analysis to get more accurate estimates of haplotype number.
First we clustered identical sequences within individuals. We are working in /labs/evolution/data/aspen/gbs/Parsed/. We ran the script perl ../Scripts/wrap_qsub_rc_uclust1.pl *_[TDN].fastq
which first converts the fastq files to fasta files and then clusters at an identity of 1.0.
cd /labs/evolution/aspen/gbs/Parsed/
perl ./fastq2fasta.pl USF_3133_D.fastq
~/bin/vsearch-1.0.7-linux-x86_64 --cluster_fast USF_3133_D.fasta --threads 4 --id 1.0 --centroids centroidsUSF_3133_D.fasta
Next we clustered sequences with 92% sequence similarity to cluster sequences by locus (or what we hope will be loci anyway). We might need to play with this number some as aspen has had a recent (about 10 million ybp) genome duplication. We ran the script perl ../Scripts/wrap_qsub_rc_uclust2.pl centroids* for this (here is an example command):
cd /labs/evolution/data/aspen/gbs/Parsed/
~/bin/vsearch-1.0.7-linux-x86_64 --cluster_fast centroidsUSF_1947_D.fasta --threads 4 --id 0.92 --centroids contig_centroidsUSF_1947_D.fasta --consout consensus_centroidsUSF_1947_D.fasta