Post date: Nov 23, 2019 3:25:23 AM
We noticed that output from GATK mutect2 was either corrupted (normal mode) or lacked variants (tumor only mode). We decided to update to the current version of GATK, 4.1.4.0, which is now in my bin directory (gatk-package-4.1.4.0-local.jar).
Unfortunately, running this version caused repeated errors. After playing around for awhile, it looks like the problem is that MuTect2 does not like apparent PCR duplicates (it also might not like secondary alignments). Of course, the problem with that is that with our GBS data, pretty much everything looks like a PCR duplicate. This means we can't just drop PCR duplicates.
To deal with this (at least for now), we are removing the first or last 1-5 bp from each read. Here is what I have done:
1. Remove 1-5 bp from the start or end of each read. Note that I am only running this on the 100 PoN individuals and the trees from the pando area. From /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Parsed (and on an interactive node... this is quick).
perl RunDrop.pl
which runs dropEnds.pl
which looks like this:
#!/usr/bin/perl
#
# drops the first or last 1-5 bases to avoid sequences looking like PCR duplicates
#
my $in = shift(@ARGV); ## fq file
my $out = "mod_$in";
open(IN, $in) or die "failed to read from $in\n";
open(OUT, "> $out") or die "failed to write to $out\n";
$offset = 1; ## how many to drop
$end = 0; ## boolean from right end?
while(<IN>){
print OUT $_; ## print header as is
$seq = <IN>;
chomp($seq);
if($end == 0){
$seq =~ s/\S{$offset}$//;
}
else{
$seq =~ s/^\S{$offset}//;
}
print OUT "$seq\n";
$hdr = <IN>;## 2nd header
print OUT $hdr;
$qual = <IN>;
chomp($qual);
if($end == 0){
$qual =~ s/\S{$offset}$//;
}
else{
$qual =~ s/^\S{$offset}//;
}
print OUT "$qual\n";
$offset++;
if($offset > 5){## switch end and reset offset count
$offset = 1;
$end = 1 - $end;
}
}
close(IN);
close(OUT);
2. Redo the alignments. Once again this is using bwa mem, but now I am only running this on the new mod*fastq files (396 total).
From the Parsed subdirectory:
perl ../Scripts/wrap_qsub_slurm_bwa_mem.pl mod_*fastq
Here is the command for one of the trees/files:
cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Parsed/
bwa mem -t 8 -k 20 -w 100 -r 1.3 -T 30 -R '@RG\tID:potr-mod_WYW-1649\tPL:ILLUMINA\tLB:potr-mod_WYW-1649\tSM:potr-mod_WYW-1649' /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa mod_WYW-1649.fastq > /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Alignments_mem/aln_mem_mod_WYW-1649.sam 2> /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Alignments_mem/log_mem_mod_WYW-1649.txt