Post date: Nov 27, 2019 3:15:39 AM
We have 486 genomes (more files than that) from multiple Timema species. Alignment to the version 3 genome was described in the last post. Here, I am documenting everything else through variant calling.
1. Remove PCR duplicates. This was done with samtools (version 1.5) and working in /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_sp_wgs/files_aln/.
sbatch SubRmdup
which runs
perl RmdupFork.pl *bam
which contains
#!/usr/bin/perl
#
# removes pcr duplicates
#
use Parallel::ForkManager;
my $max = 24;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $bam (@ARGV){
$pm->start and next FILES; ## fork
$ubam = $bam;
$ubam =~ s/sorted/uniqe/ or die "failed sub $ubam\n";
system "samtools rmdup -S $bam $ubam\n";
system "samtools index -b $ubam\n";
$pm->finish;
}
$pm->wait_all_children;
Numbers of PCR duplicates are in slurm-7684414.out and appear to be higher (~20-30%) for the 84 genomes than the original genomes (~10%).
2. Merge alignments for each individual into a single bam file.
sbatch SubMerge.sh
which runs MergeFork.pl, which looks like this:
#!/usr/bin/perl
#
# merge bam files for each individual
#
use Parallel::ForkManager;
my $max = 30;
my $pm = Parallel::ForkManager->new($max);
open(IN, "ids.txt") or die "failed to read\n";
while(<IN>){
$pm->start and next; ## fork
chomp;
$id = $_;
$out = "merge_$id".".bam";
system "samtools merge -nr $out *$id*bam\n";
system "samtools sort -O BAM -o sorted_$out $out\n";
system "samtools index sorted_$out\n";
$pm->finish;
}
3. Create g.vcf files with HaplotypeCaller from GATK The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4.
From /uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_sp_wgs/files_aln, I ran
sbatch RunGatk.sh
which runs GatkFork.pl, which looks like this:
#!/usr/bin/perl
#
# make g.vcf files
#
use Parallel::ForkManager;
my $max = 30;
my $pm = Parallel::ForkManager->new($max);
my $idir = '/uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_sp_wgs/files_aln';
my $odir = '/uufs/chpc.utah.edu/common/home/gompert-group2/data/timema_sp_wgs/variants_2019';
my $tdir = '/scratch/general/nfs1/timema';
my $genome = "/uufs/chpc.utah.edu/common/home/u6000989/data/timema/tcrDovetail/version3/mod_map_timema_06Jun2016_RvNkF702.fasta";
FILES:
foreach $bam (@ARGV){
$pm->start and next FILES; ## fork
$out = $bam;
$out =~ s/bam/g.vcf/ or die "failed here: $out\n";
system "java -Xmx48g -jar ~/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R $genome -I $idir/$bam -o $tdir/$out -gt_mode DISCOVERY -hets 0.001 -mbq 20 -out_mode EMIT_VARIANTS_ONLY -ploidy 2 -stand_call_conf 50 -pcrModel AGGRESSIVE --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000\n";
system "cp $tdir/$out $odir/$out\n";
$pm->finish;
}
$pm->wait_all_children;
4.