Post date: Jan 24, 2017 6:32:32 PM
I want to compare order and orientation of LG8 scaffolds for the green vs. dark morph. As part of this, I plan to align the mapping family data to the green morph (the de novo one) genome. Here is what I am doing.
1. index the new genome with bwa (using 0.7.10-r789)
bwa index -a is timema_green_22Dec2016_ycYxT.fasta
2. Align the mapping family sequences to the green morph reference genome. There are two runs of sequence data. The first set of commands is for the NCGR data (second run), whereas the second set is for older SeqWright data (see Second dovetail assembly). The alignment files differ in the numer (1 or 2) after the alnGx.
perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*fastq
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/mapdata/parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnGx1MAP25300.sai ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta MAP25300.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnGx1MAP25300.sam ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta alnGx1MAP25300.sai MAP25300.fastq
perl ../../scripts/wrap_qsub_slurm_bwa.pl MAP25*fastq
cd /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/timema_mappingfams/timema_linkmap/Parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnGx2MAP25300.sai ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta MAP25300.fastq
bwa samse -n 1 -r '@RG ID:timema-MAP25300' -f alnGx2MAP25300.sam ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta alnGx2MAP25300.sai MAP25300.fastq
3. Compress, sort, and index the alignments. If first moved all 384 sam files (2 per each of the 196 individuals) to /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/AlignmentsGreen/.
perl ../../scripts/wrap_qsub_slurm_sam2bam.pl alnGx*sam
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/AlignmentsGreen/
samtools view -b -S -o alnGx2MAP25300.bam alnGx2MAP25300.sam
samtools sort alnGx2MAP25300.bam alnGx2MAP25300.sorted
samtools index alnGx2MAP25300.sorted.bam
4. Merge the two alignment files for each individual.
perl ../../scripts/wrap_qsub_slurm_merge.pl alnGx1MAP25*bam
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/AlignmentsGreen/
samtools merge -n -h alnGx1MAP25300.sorted.bam mergeMAP25300.sorted.bam alnGx1MAP25300.sorted.bam alnGx2MAP25300.sorted.bam
samtools sort mergeMAP25300.sorted.bam mergeMAP25300.sorted.sorted
samtools index mergeMAP25300.sorted.sorted.bam
5. Variant calling with samtools, there is one submission file per families and each family is being processed separately. These are jobvar[ABC].sh in /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata. Here is the script for family A, the largest family with 116 individuals total (thus -d 464 = 4x coverage).
#!/bin/sh
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --account=gompert
#SBATCH --partition=kingspeak
#SBATCH --job-name=samtools
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/timema_mappingfams/mapdata/green_fam25266by25267
samtools faidx ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta
samtools mpileup -f ../../../tcrDovetail/timema_green_denovo/timema_green_22Dec2016_ycYxT/timema_green_22Dec2016_ycYxT.fasta -q 10 -Q 15 -D -g -u -I merg*sorted.sorted.bam > green_fam25266x25267.bcf
~/bin/bcftools call -o green_fam25266x25267.vcf -O v -f GQ -V indels -v -c -p 0.01 -P 1e-3 green_fam25266x25267.bcf
~/bin/vcfutils.pl varFilter -d 464 -w 5 -e 1e-10 green_fam25266x25267.vcf > filter_green_fam25266x25267.vcf
We ended up with 63,639 variants for family A, this is similar to what we got aligning to the dark morph genome.
6. Estimate recombination rates, first perl scripts are used to parse the data, and estimate genotypes:
## FAM A
perl ../../scripts/getParentGentoypes.pl 114 115 filter_green_fam25266x25267.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 12409 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 114 115 sub_parentGenotypes.txt filter_green_fam25266x25267.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
## FAM B
perl ../../scripts/getParentGentoypes.pl 24 25 filter_green_fam25268x25269.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 5006 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 24 25 sub_parentGenotypes.txt filter_green_fam25268x25269.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt
## FAM C
perl ../../scripts/getParentGentoypes.pl 19 20 filter_green_fam25270x25271.vcf
perl ../../scripts/thresholdParentGenotypes.pl 0.9 parentGenotypes.txt
#Found 3048 recombination informative loci
perl ../../scripts/calculateOffspringGenotypeProbs.pl 19 20 sub_parentGenotypes.txt filter_green_fam25270x25271.vcf
perl ../../scripts/splitP0P1.pl offspringGenotypes.txt