Post date: Jun 07, 2016 10:19:4 PM
We had a second genome assembly done by Dovetail, which included a two Chicago libraries. The new report is here (Dovetail genome report). The genome assembly is here: king:/uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta. This one rocks.
My plan is to re-designate LGs with this genome based on the old mapping families. Here is what I did.
1. index the new genome with bwa (using 0.7.10-r789).
bwa index -a is timema_06Jun2016_RvNkF.fasta
2. Align mapping family data from second run (second round of sequencing) to the reference genome with bwa.
I set-up a new directory to run the Dovetail genome alignments. Symbolic links to all of the parsed reads from the second round of sequencing (from NCGR) are in king:/uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/MAP*fastq. From within that directory I am using bwa aln and samse to align the data to the Dovetail genome. Here was the call:
perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*
And an example command:
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25300.sai ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta MAP25300.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnMAP25300.sam ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta alnMAP25300.sai MAP25300.fastq
3. Compressed, sorted and indexed the alignment:
perl ../../scripts/wrap_qsub_slurm_sam2bam.pl alnMAP25*sam
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/
samtools view -b -S -o alnMAP25300.bam alnMAP25300.sam
samtools sort alnMAP25300.bam alnMAP25300.sorted
samtools index alnMAP25300.sorted.bam
4. The fastq files for the first sequencing run (SeqWright, I think) are in /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/. I used bwa to align the sequences to the reference and samtools to compress, index and sort the alignments as described above.
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/Parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnxMAP25300.sai ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta MAP25300.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnxMAP25300.sam ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta alnxMAP25300.sai MAP25300.fastq
5. I moved all of the alignments (directly or via symbolic links) to the mapdata subdirectory. I then merged the alignments with samtools merge. Here is an example:
perl ../../scripts/wrap_qsub_slurm_merge.pl alnxMAP25*bam
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/
samtools merge -n -h alnxMAP25300.sorted.bam mergeMAP25300.sorted.bam alnxMAP25300.sorted.bam alnMAP25300.sorted.bam
samtools sort mergeMAP25300.sorted.bam mergeMAP25300.sorted.sorted
samtools index mergeMAP25300.sorted.sorted.bam
6. Next I sorted the alignments by family for variant calling. I ran a separate jobvar*sh script for each family . Note that I am using newer versions of samtools (1.2) and bcftools (1.3) and that these are rather different. One key note here is that I am requiring a minimum mean epth of 4. Here is the example command (I might want to further refine this... there are standard workflows for whole genomes, but not for GBS):
module load samtools
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/fam25266by25267/
samtools faidx ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta
samtools mpileup -f ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta -q 10 -Q 15 -D -g -u -I merg*sorted.sorted.bam > fam25266x25267.bcf
~/bin/bcftools call -o fam25266x25267.vcf -O v -f GQ -V indels -v -c -p 0.01 -P 1e-3 fam25266x25267.bcf
~/bin/vcfutils.pl varFilter -d 464 -w 5 -e 1e-10 fam25266x25267.vcf > filter_fam25266x25267.vcf
I ended up with 60,683 SNPs for family A (25266 x 26267)
7. Estimate recombination rates, first perl scripts are used to parse the data, and estimate genotypes:
## FAM A
perl ../../scripts/getParentGentoypes.pl 114 115 filter_fam25266x25267.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 12594 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 114 115 sub_parentGenotypes.txt filter_fam25266x25267.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
## FAM B
perl ../../scripts/getParentGentoypes.pl 24 25 filter_fam25268x25269.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 5052 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 24 25 sub_parentGenotypes.txt filter_fam25268x25269.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
## FAM C
perl ../../scripts/getParentGentoypes.pl 19 20 filter_fam25270x25271.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 2903 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 19 20 sub_parentGenotypes.txt filter_fam25270x25271.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt