Post date: May 04, 2015 7:50:23 PM
Pelle shared a draft assembly for aspen (P. tremuloides) with me.
The genome assembly is derived from a tree (Dan-2) that was collected by Rick Lindroth and that is included in their new tremuloides common garden. The size of the P. tremuloides assembly is 377,575,534 bp, with an N50 of 15,222. It is still highly fragmented (no real scaffolding has been done) and the total number of scaffolds is 164,504. The fasta file is here rc:/labs/evolution/data/aspen/genome/Potrs01-genome.fa.
Karen and I sequenced 192 aspen samples with GBS data. The raw data, barcode files and parsed reads can be found here: rc:/labs/evolution/data/aspen/gbs/. Here is a brief summary from the parsed reads report:
Good mids count: 259091392
Bad mids count: 39457049
Number of seqs with potential MSE adapter in seq: 13255254
5Seqs that were too short after removing MSE and beyond: 1627739
I used bwa to index the reference genome:
bwa index -a bwtsw Potrs01-genome.fa
I then used bwa to align the sequences for each individual to the reference. Here is one example (note the readgroup header should work with GATK):
cd /labs/evolution/data/aspen/gbs/Assemblies/bwa aln -n 5 -l 20 -k 2 -t 8 -q 10 -Y -f alnUSF_3142_D.sai Potrs01-genome.fa /labs/evolution/data/aspen/gbs/Parsed/USF_3142_D.fastqbwa samse -n 1 -r '@RG\tID:USF_3142_D\tPL:ILLUMINA\tLB:USF_3142_D\tSM:USF_3142_D' -f alnUSF_3142_D.sam Potrs01-genome.fa alnUSF_3142_D.sai /labs/evolution/data/aspen/gbs/Parsed/USF_3142_D.fastq
Next I used samtools to sort, compress and index the alignments (example):
cd /home/A01963476/data/aspen/gbs/Assemblies/
samtools view -b -S -o /labs/evolution/data/aspen/gbs/Assemblies/alnUSF_4701_D.bam /labs/evolution/data/aspen/gbs/Assemblies/alnUSF_4701_D.sam
samtools sort /labs/evolution/data/aspen/gbs/Assemblies/alnUSF_4701_D.bam /labs/evolution/data/aspen/gbs/Assemblies/alnUSF_4701_D.sorted
samtools index /labs/evolution/data/aspen/gbs/Assemblies/alnUSF_4701_D.sorted.bam