Post date: Sep 02, 2016 9:51:54 PM
My plan is to call variants with the sperm donor and sperm. SNPs heterozygous in the sperm donor and with data for many sperm (presumably a small number of SNPs) will be used to look at recombination rates (by LG) and to quantify segregation distortion. SNPs homozygous in the donor but with a novel allele in individual sperm with high coverage data will be used to estimate mutation rates (or rates of PCR errors??).
Here is what I did through variant calling:
1. Convert the sam files to bam files (sperm and sperm donor only):
command = perl ../Scripts/wrap_qsub_slurm_sam2bam.pl aln_*sperm*sam aln_*tcrSpr*sam
example =
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Alignments/
samtools view -b -S -o aln_195105_tcrSpr712517-bISZjrew.bam aln_195105_tcrSpr712517-bISZjrew.sam
samtools sort aln_195105_tcrSpr712517-bISZjrew.bam aln_195105_tcrSpr712517-bISZjrew.sorted
samtools index aln_195105_tcrSpr712517-bISZjrew.sorted.bam
2. Remove PCR duplicates. This seems reasonable for the high-coverage sperm donor genome, but I need to think about this for individual sperm. And this is never appropriate for the GBS data. I used a wrapper perl script for this.
perl makrDups.pl *sperm*sorted.bam
#!/usr/bin/perl
#
foreach $bam (@ARGV){
print "dedupping $bam\n";
$out = $bam;
$out =~ s/sorted/dedup/ or die "failed sub here: $out\n";
$err = $out;
$err =~ s/bam/log.txt/ or die "failed sub here: $err\n";
system "java -Xmx480g -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar MarkDuplicates INPUT=$bam OUTPUT=$out METRICS_FILE=$err\n";
}
3. Create dict file for genome (required for GATK) and index bam files with duplicates removed.
perl index.pl aln_*dedup*bam
module load picard
java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=../../tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta O=../../tcrDovetail/version2/timema_06Jun2016_RvNkF.dict
4. Variant calling with HaplotypeCaller (sperm donor only).
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Alignments/
java -Xmx920g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta -I donorbams.list -o donor.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE
5. First round of variant calling for individual sperm (all variants, will compare to donor), note that HaplotypeCaller does not support multiple threads. Instead, there is a new strategy proposed by the Broad Institute folks to running the HaplotypeCaller on each sample independently, produce an intermediate *.g.vcf file (this includes genotype likelihoods), and then to combine samples for variant calling afterwards. That is what I am trying. Here is the wrapper perl script and example (not that each infile is a file that list the four bam files for a sperm sample).
perl ../Scripts/wrap_qsub_slurm_gatk.pl bamstcrSpr7*
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Alignments/
java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta -I /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Alignments//bamstcrSpr712517.list -o /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Variants//tcrSpr712517.g.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 1 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
6. Joint genotyping across sperm samples (not donor).
sbatch spermVarCall.sh
#!/bin/sh
#SBATCH --time=200:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
echo ------------------------------------------------------
echo SLURM: job identifier is $SLURM_JOBID
echo SLURM: job name is $SLURM_JOB_NAME
echo ------------------------------------------------------
cd /uufs/chpc.utah.edu/common/home/u6000989/data/timema/SpermDonorPlus/Variants/
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version2/timema_06Jun2016_RvNkF.fasta --variant tcrSpr701502.g.vcf --variant tcrSpr701503.g.vcf --variant tcrSpr701504.g.vcf --variant tcrSpr701505.g.vcf --variant tcrSpr701506.g.vcf --variant tcrSpr701507.g.vcf --variant tcrSpr701508.g.vcf --variant tcrSpr701517.g.vcf --variant tcrSpr702502.g.vcf --variant tcrSpr702503.g.vcf --variant tcrSpr702504.g.vcf --variant tcrSpr702505.g.vcf --variant tcrSpr702506.g.vcf --variant tcrSpr702507.g.vcf --variant tcrSpr702508.g.vcf --variant tcrSpr702517.g.vcf --variant tcrSpr703502.g.vcf --variant tcrSpr703503.g.vcf --variant tcrSpr703504.g.vcf --variant tcrSpr703505.g.vcf --variant tcrSpr703506.g.vcf --variant tcrSpr703507.g.vcf --variant tcrSpr703508.g.vcf --variant tcrSpr703517.g.vcf --variant tcrSpr704502.g.vcf --variant tcrSpr704503.g.vcf --variant tcrSpr704504.g.vcf --variant tcrSpr704505.g.vcf --variant tcrSpr704506.g.vcf --variant tcrSpr704507.g.vcf --variant tcrSpr704508.g.vcf --variant tcrSpr704517.g.vcf --variant tcrSpr705502.g.vcf --variant tcrSpr705503.g.vcf --variant tcrSpr705504.g.vcf --variant tcrSpr705505.g.vcf --variant tcrSpr705506.g.vcf --variant tcrSpr705507.g.vcf --variant tcrSpr705508.g.vcf --variant tcrSpr705517.g.vcf --variant tcrSpr706502.g.vcf --variant tcrSpr706503.g.vcf --variant tcrSpr706504.g.vcf --variant tcrSpr706505.g.vcf --variant tcrSpr706506.g.vcf --variant tcrSpr706507.g.vcf --variant tcrSpr706508.g.vcf --variant tcrSpr706517.g.vcf --variant tcrSpr707502.g.vcf --variant tcrSpr707503.g.vcf --variant tcrSpr707504.g.vcf --variant tcrSpr707505.g.vcf --variant tcrSpr707506.g.vcf --variant tcrSpr707507.g.vcf --variant tcrSpr707508.g.vcf --variant tcrSpr707517.g.vcf --variant tcrSpr708502.g.vcf --variant tcrSpr708503.g.vcf --variant tcrSpr708504.g.vcf --variant tcrSpr708505.g.vcf --variant tcrSpr708506.g.vcf --variant tcrSpr708507.g.vcf --variant tcrSpr708508.g.vcf --variant tcrSpr708517.g.vcf --variant tcrSpr709502.g.vcf --variant tcrSpr709503.g.vcf --variant tcrSpr709504.g.vcf --variant tcrSpr709505.g.vcf --variant tcrSpr709506.g.vcf --variant tcrSpr709507.g.vcf --variant tcrSpr709508.g.vcf --variant tcrSpr709517.g.vcf --variant tcrSpr710502.g.vcf --variant tcrSpr710503.g.vcf --variant tcrSpr710504.g.vcf --variant tcrSpr710505.g.vcf --variant tcrSpr710506.g.vcf --variant tcrSpr710507.g.vcf --variant tcrSpr710508.g.vcf --variant tcrSpr710517.g.vcf --variant tcrSpr711502.g.vcf --variant tcrSpr711503.g.vcf --variant tcrSpr711504.g.vcf --variant tcrSpr711505.g.vcf --variant tcrSpr711506.g.vcf --variant tcrSpr711507.g.vcf --variant tcrSpr711508.g.vcf --variant tcrSpr711517.g.vcf --variant tcrSpr712502.g.vcf --variant tcrSpr712503.g.vcf --variant tcrSpr712504.g.vcf --variant tcrSpr712505.g.vcf --variant tcrSpr712506.g.vcf --variant tcrSpr712507.g.vcf --variant tcrSpr712508.g.vcf --variant tcrSpr712517.g.vcf -ploidy 1 -o spermvariants.vcf