Post date: Jan 23, 2015 12:12:19 AM
Working from /home/A01963476/data/callosobruchus/gbs/Parsed
1. Splitting fastq files by individual (note, I combined the two good gomp003 runs, but not the initial failure):
perl ../Scripts/splitFastq.pl ids.txt parsed_gomp003_combined.fastq parsed_gomp004_NoIndex_L006_R1_001.fastq parsed_gomp005_NoIndex_L007_R1_001.fastq
2. Cluster highly similar sequences together within individuals (98% identity) and output the centroids (command perl ../Scripts/wrap_qsub_rc_uclust1.pl [LM]*fastq; 576 jobs, here is one):
cd /labs/evolution/data/callosobruchus/gbs/Parsed/
~/bin/usearch -cluster_fast M-14-48.fastq -sort length -id 0.98 --centroids centroidsM-14-48.fasta
3. Combine the centroid sequences for individuals from the same line (could do all, but not enough memory). Note, I first had to subsample 10 million sequences (at random) from each file to stay under 4 Gb (they were between 21 and 87 million). Then cluster highly similar sequences together within lines (98% identity) and output a centroid file per line (command perl ../../Scripts/wrap_qsub_rc_uclust2.pl cent[LM]*; 7 jobs, here is one):
cd /labs/evolution/data/callosobruchus/gbs/Assemblies/gbsdenovo/
~/bin/usearch -fastx_subsample centL1.fasta -sample_size 10000000 -fastaout subcentL1.fasta
~/bin/usearch -cluster_fast subcentL1.fasta -sort length -id 0.98 --centroids uniqueL1.fasta
3a.Note, even with the reduction below, it would take too much memory to cluster all of the unique sequences across populations. vsearch (a free, open-source, recently released, 64 bit relative of usearch) might be a better option. To try this I combined the individual files (by combining the population centroid files), shuffled these and the re-headered them (so that they would sort by length but not by the original ids). I am now clustering these (note, I still had to reduce to random subset of 54,631,148 sequences).
~/bin/vsearch-1.0.7-linux-x86_64 --threads 16 --shuffle centAll.fasta --output shuffCentAll.fasta
perl rehead.pl shuffCentAll.fasta
~/bin/vsearch-1.0.7-linux-x86_64 --threads 24 --cluster_fast mod_shuffCentAllsub.fasta --centroids contigCentroids.fasta --consout contigConsenus.fasta --msaout contigMsa.fasta --id 0.9 --iddef 2 --mincols 64 --minsl 0.8 --minseqlength 64 --fasta_width 66
## RESULTS ##
Reading file mod_shuffCentAllsub.fasta 100%
3670949081 nt in 43243064 seqs, min 64, max 87, avg 85
WARNING: 11388084 sequences shorter than 64 nucleotides discarded.
Indexing sequences 100%
Masking 100%
Sorting by length 100%
Counting unique k-mers 100%
Clustering 100%
Writing clusters 100%
Clusters: 3975590 Size min 1, max 74722, avg 10.9
Singletons: 2785816, 6.4% of seqs, 70.1% of clusters
Multiple alignments 100%
vsearch v1.0.7_linux_x86_64, 1009.9GB RAM, 64 cores
4. Here I am clustering the consensus sequences from 3a to see if any collapse at a lower id (80%). Any that do will be removed before moving forward.
~/bin/vsearch-1.0.7-linux-x86_64 --threads 12 --cluster_fast contigConsensus.fasta --msaout paralogsMsa.fasta --id 0.8 --iddef 2 --mincols 64 --minsl 0.8 --minseqlength 64 --fasta_width 66
## RESULTS ##
Reading file contigConsenus.fasta 100%
339754510 nt in 3975583 seqs, min 64, max 115, avg 85
WARNING: 7 sequences shorter than 64 nucleotides discarded.
Indexing sequences 100%
Masking 100%
Sorting by length 100%
Counting unique k-mers 100%
Clustering 100%
Writing clusters 100%
Clusters: 2157425 Size min 1, max 9017, avg 1.8
Singletons: 1698565, 42.7% of seqs, 78.7% of clusters
Multiple alignments 100%
vsearch v1.0.7_linux_x86_64, 252.4GB RAM, 64 cores
https://github.com/torognes/vsearch
5. I retained the subset of distinct (non-collapsed) contigs that had at least 8 reads originally. I then converted these back to fasta format and only kept contigs with consensus sequences between 72 and 92 bps in length. in all 93,134 contigs were retained. These are in gbs_contigs.fasta.
perl extractSingletons.pl
perl noMatch2fasta.pl
6. A single contig with a significant blastn hit to wolbachia was dropped from the final file, leaving 93,133 contigs.