Second dovetail assembly

Post date: Jun 07, 2016 10:19:4 PM

We had a second genome assembly done by Dovetail, which included a two Chicago libraries. The new report is here (Dovetail genome report). The genome assembly is here: king:/uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta. This one rocks.

My plan is to re-designate LGs with this genome based on the old mapping families. Here is what I did.

1. index the new genome with bwa (using 0.7.10-r789).

bwa index -a is timema_06Jun2016_RvNkF.fasta

2. Align mapping family data from second run (second round of sequencing) to the reference genome with bwa.

I set-up a new directory to run the Dovetail genome alignments. Symbolic links to all of the parsed reads from the second round of sequencing (from NCGR) are in king:/uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/MAP*fastq. From within that directory I am using bwa aln and samse to align the data to the Dovetail genome. Here was the call:

perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*

And an example command:

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/

bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnMAP25300.sai ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta MAP25300.fastq

bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnMAP25300.sam ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta alnMAP25300.sai MAP25300.fastq

3. Compressed, sorted and indexed the alignment:

perl ../../scripts/wrap_qsub_slurm_sam2bam.pl alnMAP25*sam

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/

samtools view -b -S -o alnMAP25300.bam alnMAP25300.sam

samtools sort alnMAP25300.bam alnMAP25300.sorted

samtools index alnMAP25300.sorted.bam

4. The fastq files for the first sequencing run (SeqWright, I think) are in /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/. I used bwa to align the sequences to the reference and samtools to compress, index and sort the alignments as described above.

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/Parsed/

bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnxMAP25300.sai ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta MAP25300.fastq

bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnxMAP25300.sam ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta alnxMAP25300.sai MAP25300.fastq

5. I moved all of the alignments (directly or via symbolic links) to the mapdata subdirectory. I then merged the alignments with samtools merge. Here is an example:

perl ../../scripts/wrap_qsub_slurm_merge.pl alnxMAP25*bam

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/Alignments/

samtools merge -n -h alnxMAP25300.sorted.bam mergeMAP25300.sorted.bam alnxMAP25300.sorted.bam alnMAP25300.sorted.bam

samtools sort mergeMAP25300.sorted.bam mergeMAP25300.sorted.sorted

samtools index mergeMAP25300.sorted.sorted.bam

6. Next I sorted the alignments by family for variant calling. I ran a separate jobvar*sh script for each family . Note that I am using newer versions of samtools (1.2) and bcftools (1.3) and that these are rather different. One key note here is that I am requiring a minimum mean epth of 4. Here is the example command (I might want to further refine this... there are standard workflows for whole genomes, but not for GBS):

module load samtools

cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/fam25266by25267/

samtools faidx ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta

samtools mpileup -f ../../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta -q 10 -Q 15 -D -g -u -I merg*sorted.sorted.bam > fam25266x25267.bcf

~/bin/bcftools call -o fam25266x25267.vcf -O v -f GQ -V indels -v -c -p 0.01 -P 1e-3 fam25266x25267.bcf

~/bin/vcfutils.pl varFilter -d 464 -w 5 -e 1e-10 fam25266x25267.vcf > filter_fam25266x25267.vcf

I ended up with 60,683 SNPs for family A (25266 x 26267)

7. Estimate recombination rates, first perl scripts are used to parse the data, and estimate genotypes:

## FAM A

perl ../../scripts/getParentGentoypes.pl 114 115 filter_fam25266x25267.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt

#Found 12594 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 114 115 sub_parentGenotypes.txt filter_fam25266x25267.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

## FAM B

perl ../../scripts/getParentGentoypes.pl 24 25 filter_fam25268x25269.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt

#Found 5052 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 24 25 sub_parentGenotypes.txt filter_fam25268x25269.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

## FAM C

perl ../../scripts/getParentGentoypes.pl 19 20 filter_fam25270x25271.vcf

perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt

#Found 2903 recombination informative loci

perl ../../scripts/calculateOffspringGenotypeProbs.pl 19 20 sub_parentGenotypes.txt filter_fam25270x25271.vcf

perl ../../scripts/splitP0P1.pl offspringGenotypes.txt

Page updated

Google Sites

Report abuse