Post date: Jun 03, 2018 3:1:4 AM
This covers the 768 individuals sequenced and analyzed previously and and additional 768 individuals sequenced in 2018. The data are in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/.
1. Use bwa (0.7.17-r1188) to index the new Dovetail version of the L. melissa genome.
bwa index melissa_blue_21Nov2017_GLtS4.fasta
2. Use bwa aln and samse (0.7.17-r1188)) to align the first data set (pre-2018) to the new genome.
This used a perl wrapper script run from the Parsed subdirectory to submit 768 alignment jobs.
perl ../Scripts/wrap_qsub_slurm_bwa.pl *fastq
Here are the commands for on individual.
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Parsed/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnUSL-15-50.sai /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/melissa_blue_21Nov2017_GLtS4.fasta USL-15-50.fastq
bwa samse -n 1 -r '@RG\tID:lyc-USL-15-50\tPL:ILLUMINA\tLB:lyc-USL-15-50\tSM:lyc-USL-15-50' -f alnUSL-15-50.sam /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/melissa_blue_21Nov2017_GLtS4.fasta alnUSL-15-50.sai USL-15-50.fastq
3. Align the new data to the reference genome in a similar fashion. This was done as described above (2) and included 768 additional alignments. Here is the perl command (from the Parsed_2018 subdirectory).
perl ../Scripts/wrap_qsub_slurm_bwa.pl *fastq
And here is an example individual alignment command.
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Parsed_2018/
bwa aln -n 4 -l 20 -k 2 -t 8 -q 10 -f alnUSL-17-59-M.sai /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/melissa_blue_21Nov2017_GLtS4.fasta USL-17-59-M.fastq
bwa samse -n 1 -r '@RG\tID:lyc-USL-17-59-M\tPL:ILLUMINA\tLB:lyc-USL-17-59-M\tSM:lyc-USL-17-59-M' -f alnUSL-17-59-M.sam /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/melissa_blue_21Nov2017_GLtS4.fasta alnUSL-17-59-M.sai USL-17-59-M.fastq
4. Next, all 1536 alignments (sam and sai files from 2 and 3) were moved to the Alignments_2018 subdirectory and compressed, sorted and indexed with samtools (version 1.5 using htslib 1.5). This was done with the Sam2Bam.sh script which executes the perl script Sam2BamFork.pl (from the Scripts subdirectory):
#!/usr/bin/perl
#
# conver sam to bam, then sort and index
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $sam (@ARGV){
$pm->start and next FILES; ## fork
$sam =~ m/^([A-Za-z0-9_\-]+\.)sam/ or die "failed to match $fq\n";
$base = $1;
system "samtools view -b -O BAM -o $base"."bam $sam\n";
system "samtools sort -O BAM -o $base"."sorted.bam $base"."bam\n";
system "samtools index -b $base"."sorted.bam\n";
$pm->finish;
}
$pm->wait_all_children;
5. Fix format for reference genome and call variants with GATK's HaplotypeCaller (v3.5-0-g36282e4). Dovetail included '=' in the fasta headers for the reference genome, which caused GATK to fail. I fixed this by first fixing and re-indexing the genome (replacing '=' with '_'). The new genome is mod_melissa_blue_21Nov2017_GLtS4.fasta.
cat melissa_blue_21Nov2017_GLtS4.fasta | perl -p -i -e's/=/_/' > mod_melissa_blue_21Nov2017_GLtS4.fasta
## create dict/index files
bwa index mod_melissa_blue_21Nov2017_GLtS4.fasta
java -jar /uufs/chpc.utah.edu/sys/installdir/picard/2.1.1/picard.jar CreateSequenceDictionary R=mod_melissa_blue_21Nov2017_GLtS4.fasta O=mod_melissa_blue_21Nov2017_GLtS4.dict
samtools faidx ~/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta
I then converted the bam alignments back to sam, fixed the '=' by replacing with '_' and reconverted to mod*bam files. (run from the Alignments_2018 folder).
perl fixAllSams.pl *sorted.bam
Then I used GATK's haplotype caller to generate g.vcf files, which will be in the Variants_2018 subdirectory. This was run from the Alignments_2018 subdirectory and initially writes to scratch and spreads the load without fork over the cluster (this seems fast enough, should take a day or two).
perl ../Scripts/wrap_qsub_slurm_gatk.pl mod_*bam
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Alignments_2018
java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta -I /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Alignments_2018/mod_alnUSL-17-59-M.sorted.bam -o /scratch/general/lustre/lycGatk/mod_alnUSL-17-59-M.sorted.g.vcf -gt_mode DISCOVERY -hets 0.001 -mbq 30 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000
cp /scratch/general/lustre/lycGatk/mod_alnUSL-17-59-M.sorted.g.vcf /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_2018/mod_alnUSL-17-59-M.sorted.g.vcf
My plan had been to run joint variant calling by region (scaffold). But this might actually be a memory issue and GATK joint variant calling appears to happily support proper multithreading. I tried this as a single big job, but ran into an issue with the program having trouble establishing locks on idx files. I searched online and this likely was caused by having too many g.vcf files (this will be fixed in a later version of GATK). Instead, I used an option to first combine sets of g.vcf files (96 at a time) into combined g.vcf files:
perl runCombineGVCFs.pl
#!/usr/bin/perl
# fix scaffold names
use Parallel::ForkManager;
my $max = 10;
my $pm = Parallel::ForkManager->new($max);
open(IN,"jobs.txt") or die;
while(<IN>){
push(@jobs,$_);
}
FILES:
foreach $job (@jobs){
$pm->start and next FILES; ## fork
print "JOB=$job\n\n";
system "$job\n";
$pm->finish;
}
$pm->wait_all_children;
Jobs has 16 commands, each to combine 96 g.vcf files. Here is one of them.
java -Xmx80g -jar ~/bin/GenomeAnalysisTK.jar -T CombineGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta --variant mod_alnBCR-17-07-F.sorted.g.vcf --variant mod_alnBCR-17-08-F.sorted.g.vcf --variant mod_alnBCR-17-09-F.sorted.g.vcf --variant mod_alnBCR-17-10-F.sorted.g.vcf --variant mod_alnBCR-17-11-F.sorted.g.vcf --variant mod_alnBCR-17-12-F.sorted.g.vcf --variant mod_alnBCR-17-13-F.sorted.g.vcf --variant mod_alnBCR-17-15-M.sorted.g.vcf --variant mod_alnBCR-17-16-M.sorted.g.vcf --variant mod_alnBCR-17-17-M.sorted.g.vcf --variant mod_alnBCR-17-18-M.sorted.g.vcf --variant mod_alnBCR-17-19-M.sorted.g.vcf --variant mod_alnBCR-17-20-M.sorted.g.vcf --variant mod_alnBCR-17-21-M.sorted.g.vcf --variant mod_alnBCR-17-22-M.sorted.g.vcf --variant mod_alnBCR-17-23-M.sorted.g.vcf --variant mod_alnBCR-17-24-M.sorted.g.vcf --variant mod_alnBCR-17-25-M.sorted.g.vcf --variant mod_alnBCR-17-26-M.sorted.g.vcf --variant mod_alnBCR-17-27-M.sorted.g.vcf --variant mod_alnBCR-17-28-M.sorted.g.vcf --variant mod_alnBCR-17-29-M.sorted.g.vcf --variant mod_alnBCR-17-30-M.sorted.g.vcf --variant mod_alnBCR-17-31-M.sorted.g.vcf --variant mod_alnBCR-17-32-M.sorted.g.vcf --variant mod_alnBCR-17-33-M.sorted.g.vcf --variant mod_alnBCR-17-34-M.sorted.g.vcf --variant mod_alnBCR-17-35-M.sorted.g.vcf --variant mod_alnBCR-17-36-M.sorted.g.vcf --variant mod_alnBCR-17-37-M.sorted.g.vcf --variant mod_alnBCR-17-38-M.sorted.g.vcf --variant mod_alnBCR-17-39-M.sorted.g.vcf --variant mod_alnBCR-17-40-M.sorted.g.vcf --variant mod_alnBCR-17-41-M.sorted.g.vcf --variant mod_alnBCR-17-42-M.sorted.g.vcf --variant mod_alnBCR-17-43-M.sorted.g.vcf --variant mod_alnBCR-17-44-M.sorted.g.vcf --variant mod_alnBCR-17-45-M.sorted.g.vcf --variant mod_alnBCR-17-46-M.sorted.g.vcf --variant mod_alnBCR-17-47-M.sorted.g.vcf --variant mod_alnBCR-17-48-M.sorted.g.vcf --variant mod_alnBCR-17-49-M.sorted.g.vcf --variant mod_alnBCR-17-50-M.sorted.g.vcf --variant mod_alnBNP-13-01.sorted.g.vcf --variant mod_alnBNP-13-02.sorted.g.vcf --variant mod_alnBNP-13-03.sorted.g.vcf --variant mod_alnBNP-13-04.sorted.g.vcf --variant mod_alnBNP-13-05.sorted.g.vcf --variant mod_alnBNP-13-06.sorted.g.vcf --variant mod_alnBNP-13-07.sorted.g.vcf --variant mod_alnBNP-13-08.sorted.g.vcf --variant mod_alnBNP-13-09.sorted.g.vcf --variant mod_alnBNP-13-10.sorted.g.vcf --variant mod_alnBNP-13-11.sorted.g.vcf --variant mod_alnBNP-13-12.sorted.g.vcf --variant mod_alnBNP-13-13.sorted.g.vcf --variant mod_alnBNP-13-14.sorted.g.vcf --variant mod_alnBNP-13-15.sorted.g.vcf --variant mod_alnBNP-13-16.sorted.g.vcf --variant mod_alnBNP-13-17.sorted.g.vcf --variant mod_alnBNP-13-18.sorted.g.vcf --variant mod_alnBNP-13-19.sorted.g.vcf --variant mod_alnBNP-13-20.sorted.g.vcf --variant mod_alnBNP-13-21.sorted.g.vcf --variant mod_alnBNP-13-22.sorted.g.vcf --variant mod_alnBNP-13-23.sorted.g.vcf --variant mod_alnBNP-13-24.sorted.g.vcf --variant mod_alnBNP-13-25.sorted.g.vcf --variant mod_alnBNP-13-26.sorted.g.vcf --variant mod_alnBNP-13-27.sorted.g.vcf --variant mod_alnBNP-13-28.sorted.g.vcf --variant mod_alnBNP-13-29.sorted.g.vcf --variant mod_alnBNP-13-30.sorted.g.vcf --variant mod_alnBNP-13-31.sorted.g.vcf --variant mod_alnBNP-13-32.sorted.g.vcf --variant mod_alnBNP-13-33.sorted.g.vcf --variant mod_alnBNP-13-34.sorted.g.vcf --variant mod_alnBNP-13-35.sorted.g.vcf --variant mod_alnBNP-13-36.sorted.g.vcf --variant mod_alnBNP-13-37.sorted.g.vcf --variant mod_alnBNP-13-38.sorted.g.vcf --variant mod_alnBNP-13-39.sorted.g.vcf --variant mod_alnBNP-13-40.sorted.g.vcf --variant mod_alnBNP-13-41.sorted.g.vcf --variant mod_alnBNP-13-42.sorted.g.vcf --variant mod_alnBNP-13-43.sorted.g.vcf --variant mod_alnBNP-13-44.sorted.g.vcf --variant mod_alnBNP-13-45.sorted.g.vcf --variant mod_alnBNP-13-46.sorted.g.vcf --variant mod_alnBNP-13-47.sorted.g.vcf --variant mod_alnBNP-13-48.sorted.g.vcf --variant mod_alnBNP-13-49.sorted.g.vcf --variant mod_alnBNP-15-01.sorted.g.vcf --variant mod_alnBNP-15-02.sorted.g.vcf --variant mod_alnBNP-15-03.sorted.g.vcf --variant mod_alnBNP-15-04.sorted.g.vcf -o combinded_2.g.vcf
Next, I genotyped the combined file. This was actually fast, it took about 90 minutes.
sbatch joinCallVar.sh
#!/bin/sh
#SBATCH --time=200:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=gatk
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_2018/
java -Xmx460g -jar ~/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/dovetail_melissa_genome/download/HiC_HiCRise_GLtS4/melissa_blue_21Nov2017_GLtS4/mod_melissa_blue_21Nov2017_GLtS4.fasta -hets 0.001 -nt 20 --variant combinded_10.g.vcf --variant combinded_11.g.vcf --variant combinded_12.g.vcf --variant combinded_13.g.vcf --variant combinded_14.g.vcf --variant combinded_15.g.vcf --variant combinded_16.g.vcf --variant combinded_1.g.vcf --variant combinded_2.g.vcf --variant combinded_3.g.vcf --variant combinded_4.g.vcf --variant combinded_5.g.vcf --variant combinded_6.g.vcf --variant combinded_7.g.vcf --variant combinded_8.g.vcf --variant combinded_9.g.vcf -o lycTimeSeriesVariants.vcf
This generated the file lycTimeSeriesVariants.vcf with 1,144,223 variants across 1536 individuals.