Post date: Apr 05, 2019 8:24:39 PM
Before moving on, I want to compare the variant sets I get from GATK with those from samtools/bcftools. My general thought is to then run analyses with (a) the GATK variants, and (b) the GATK variants validated by samtools ('high-confidence variants'). For the latter, I want to run at least the initial analyses with genotype likelihoods from samtools/bcftools and GATK.
Variant data from samtools are in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_Samtools/. The script used for variant calling is SamVarCall.sh, and it looks like this:
#!/bin/sh
#SBATCH --time=240:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=24
#SBATCH --account=gompert-kp
#SBATCH --partition=gompert-kp
#SBATCH --job-name=sam_varcall
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=zach.gompert@usu.edu
#SAMTOOLs mpileup version 1.5 options used:
#C = adjust mapping quality; recommended:50, disable:0 [0]
#d = max per-file depth; avoids excessive memory usage [250]
#f = faidx indexed reference sequence file
#q = skip alignments with mapQ smaller than INT [0]
#Q = skip bases with baseQ/BAQ smaller than INT [13]
#g = generate genotype likelihoods in BCF format
#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []
#BCFTOOLs call version 1.6 options used
#v = output variant sites only
#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)
#p = variant if P(ref|D)<FLOAT with -c [0.5]
#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]
#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')
#o = write output to a file [standard output]
module load samtools
module load bcftools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_Samtools
samtools mpileup -C 50 -d 250 -f melissa_blue_21Nov2017_GLtS4.fasta -q 30 -Q 20 -g -I -t DP,AD -u -b pops_bams.txt -o lyc_timeseries_samtbcft.bcf
bcftools call -v -c -p 0.01 -P 0.001 -O v -o lyc_timeseries_samtbcft.vcf lyc_timeseries_samtbcft.bcf
This generates the variant file lyc_timeseries_samtbcft.vcf. I filtered this with ../Scripts/vcfFilter.pl with the following options:
#### stringency variables, edits as desired
my $minCoverage = 3072; # minimum number of sequences; DP
my $minAltRds = 20; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 30; # minimum mapping quality; MQ
my $miss = 307; # maximum number of individuals with no data
########
From this I retained 309,534 SNPs. The filterning appears to help, in the sense that the transition:transversion ratio increased after filtering:
pre-filter
TSTV, transitions/transversions:
TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 605839 408055 1.48 605123 405275 1.49
post-filter
TSTV, transitions/transversions:
TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 191619 117915 1.63 191619 117915 1.63
Next, I filtered based on excessive depth and on having too many SNPs near each other:
grep ^Sc filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > depth.txt
## in R
d<-matrix(scan("depth.txt",n=309534*1536,sep=" "),nrow=309534,ncol=1536,byrow=TRUE)
mns<-apply(d,1,sum)
mean(mns)+2*sd(mns)
At this point the numbers of SNPs are as follows:
I then looked at how coverage/depth varied by run. This wasn't as strong as I had originally thought. Still, perhaps filtering based on differences in depth still makes sense.
grep ^Sc filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > depth.txt
grep ^Sc morefilter_filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > more_depth.txt
and for the GATK variants (in Variants_2018)
grep ^Sc filtered2x_lycTimeSeriesVariants.vcf | perl -p -i -e 's/^.*PL\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:\S+/\1/g' | perl -p -i -e 's/\.\/\.\:\d+,\d+\://g' | perl -p -i -e 's/\S+[ATCG]\S*/0/g' > depth.txt
grep ^Sc morefilter_filtered2x_lycTimeSeriesVariants.vcf | perl -p -i -e 's/^.*PL\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:\S+/\1/g' | perl -p -i -e 's/\.\/\.\:\d+,\d+\://g' | perl -p -i -e 's/\S+[ATCG]\S*/0/g' > more_depth.txt
Most SNPs show similar coverage levels between data sets (once you account for overall differences in coverage, which are not that pronounced). I ran CovAnalysis.R on the GATK and Samtools variants. This outputs some plots and a file specifying whether or not each SNPs has similar to expected (50% plus or minus the mean ratio) coverage in the two data sets. The latter is given in evenCoverageSnps.txt. I then wrote a new perl script to subset those SNPs.
I ran this on the GATK data set:
perl subsetEvenCov.pl
which created covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf with 30,033 SNPs (compare to morefilter*).
I did the same thing for the samtools/bcftools variants, although the cut-off doesn't appear to capture things as well. However, I won't worry about that for now, as my plan is to focus on the covsub* GATK SNPs, either alone or the set the overlaps with filter/morefilter samtools SNPs. I do however want to work with both sets of genotype likelihoods.
Thus, I used bcftools 1.6 to find overlapping sets of SNPs. Note that versions with GATK genotype likelihoods are in the Variants_2018 subdirectory, whereas versions with samtools/bcftools genotype likelihoods are in the Variants_Samtools subdirectory. Note also that the gz files were all compressed with bgzip (this is required).
bcftools isec -p vcf_isec_gatk/ -n=2 -w1 covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf.gz ../Variants_Samtools/filtered2x_lyc_timeseries_samtbcft.vcf.gz &
bcftools isec -p vcf_isec_sam/ -n=2 -w1 filtered2x_lyc_timeseries_samtbcft.vcf.gz ../Variants_2018/covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf.gz
These files (0000.vcf) contain 12,886 SNPs, which along with the set of 30,033 SNPs in the GATK covsub* file will be the basis for my initial analyses.