Post date: Aug 30, 2013 8:3:44 PM
Here are my notes for the de novo assembly part of the assembly. You can see more details by looking in the scripts and files that I mention.
First, I am going to a perform a de novo assembly with a subset of the sequence data to build a new pseudo reference chromosome.
The de novo assembly used 40 million sequences (the first 5 million from each of the eight lanes). The sequences are at /data/local/lycaeides_bgs/Assemblies/gbs_de_novo/lycaeides_sub_lanes1to8.fastq.
The de novo assembly script was lycaeides_40mil_denovo.smng.txt. I used a minimum match percentage of 93%, match size of 70, match spacing of 100, and a minimum of 10 sequences to retain a contig. Other parameters were similar to those that I used for the Lycaeides Tetons de novo assembly. The assembly was written to /data/local/lycaeides_gbs/Assemblies/gbs_de_novo/lycaeides_40mil_denovo_mmp93ms100.ace.
GBS-based reference for Lycaeides.21.3 million of the 40 million sequences assembled into 283,831 contigs, with a contig N50 of 87 bases. I selected the consensus sequences of the subset of these contigs that were 82 to 92 bases in length and contained all forward sequences. I used the pruneContigs.pl perl script from /data/local/lycaeides_gbs/Scripts/ for this task. This included 269,962 contigs and the consensus sequences are in /data/local/lycaeides_gbs/Assemblies/gbs_de_novo/pruned_lycaeides_40mil_denovo_mmp93ms100.fasta. I then tried to assemble these sequences to themselves to identify similar, potentially repetitive contigs. I performed this assembly with seq-man ngen (smng) with a minimum match percentage of 80%. 252,014 contigs did not assemble (these are the good ones) and were written to the file lycaeides_40mil_contigs.fastq. Next I need to blast these sequences against a Wolbachia genome database and split the raw sequence data into individual fastq files.
I converted the fastq contig file to a fasta file and transferred it to seismic to blast against the Wolbachia genome database I used for the hybrid zone data. This database comes from Wolbachia_NC_006833.fasta. I ran blastn with an e-value of 1e-7. The results were written to blast_wolbachia_out.
252000 contigs did not match Wolbachia. These good contigs are in a fasta file that will function as my reference genome: lycaeides_gbscontigs_reference.fasta. I determined this by running the following scripts: perl getBlastHits.pl blast_wolbachia_out
perl removeWolbachia.pl hits.csv lycaeides_40mil_contigs.fasta