Comparison of GATK and bcftools variants

Post date: Apr 05, 2019 8:24:39 PM

Before moving on, I want to compare the variant sets I get from GATK with those from samtools/bcftools. My general thought is to then run analyses with (a) the GATK variants, and (b) the GATK variants validated by samtools ('high-confidence variants'). For the latter, I want to run at least the initial analyses with genotype likelihoods from samtools/bcftools and GATK.

Variant data from samtools are in /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_Samtools/. The script used for variant calling is SamVarCall.sh, and it looks like this:

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=24

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=sam_varcall

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

module load samtools

module load bcftools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/timeseries/Variants_Samtools

samtools mpileup -C 50 -d 250 -f melissa_blue_21Nov2017_GLtS4.fasta -q 30 -Q 20 -g -I -t DP,AD -u -b pops_bams.txt -o lyc_timeseries_samtbcft.bcf

bcftools call -v -c -p 0.01 -P 0.001 -O v -o lyc_timeseries_samtbcft.vcf lyc_timeseries_samtbcft.bcf

This generates the variant file lyc_timeseries_samtbcft.vcf. I filtered this with ../Scripts/vcfFilter.pl with the following options:

#### stringency variables, edits as desired

my $minCoverage = 3072; # minimum number of sequences; DP

my $minAltRds = 20; # minimum number of sequences with the alternative allele; AC

my $notFixed = 1.0; # removes loci fixed for alt; AF

my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum

my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum

my $rprs = 2; # maximum absolute value of the read position rank sum test; ReadPosRankSum

my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD

my $mq = 30; # minimum mapping quality; MQ

my $miss = 307; # maximum number of individuals with no data

########

From this I retained 309,534 SNPs. The filterning appears to help, in the sense that the transition:transversion ratio increased after filtering:

pre-filter

TSTV, transitions/transversions:

TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)

TSTV 0 605839 408055 1.48 605123 405275 1.49

post-filter

TSTV, transitions/transversions:

TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)

TSTV 0 191619 117915 1.63 191619 117915 1.63

Next, I filtered based on excessive depth and on having too many SNPs near each other:

grep ^Sc filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > depth.txt

## in R

d<-matrix(scan("depth.txt",n=309534*1536,sep=" "),nrow=309534,ncol=1536,byrow=TRUE)

mns<-apply(d,1,sum)

mean(mns)+2*sd(mns)

At this point the numbers of SNPs are as follows:

I then looked at how coverage/depth varied by run. This wasn't as strong as I had originally thought. Still, perhaps filtering based on differences in depth still makes sense.

grep ^Sc filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > depth.txt

grep ^Sc morefilter_filtered2x_lyc_timeseries_samtbcft.vcf | perl -p -i -e 's/^.*AD\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:[0-9,]+/\1/g' > more_depth.txt

and for the GATK variants (in Variants_2018)

grep ^Sc filtered2x_lycTimeSeriesVariants.vcf | perl -p -i -e 's/^.*PL\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:\S+/\1/g' | perl -p -i -e 's/\.\/\.\:\d+,\d+\://g' | perl -p -i -e 's/\S+[ATCG]\S*/0/g' > depth.txt

grep ^Sc morefilter_filtered2x_lycTimeSeriesVariants.vcf | perl -p -i -e 's/^.*PL\s+//' | perl -p -i -e 's/\d\/\d\:[0-9,]+\:(\d+)\:\S+/\1/g' | perl -p -i -e 's/\.\/\.\:\d+,\d+\://g' | perl -p -i -e 's/\S+[ATCG]\S*/0/g' > more_depth.txt

Most SNPs show similar coverage levels between data sets (once you account for overall differences in coverage, which are not that pronounced). I ran CovAnalysis.R on the GATK and Samtools variants. This outputs some plots and a file specifying whether or not each SNPs has similar to expected (50% plus or minus the mean ratio) coverage in the two data sets. The latter is given in evenCoverageSnps.txt. I then wrote a new perl script to subset those SNPs.

I ran this on the GATK data set:

perl subsetEvenCov.pl

which created covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf with 30,033 SNPs (compare to morefilter*).

I did the same thing for the samtools/bcftools variants, although the cut-off doesn't appear to capture things as well. However, I won't worry about that for now, as my plan is to focus on the covsub* GATK SNPs, either alone or the set the overlaps with filter/morefilter samtools SNPs. I do however want to work with both sets of genotype likelihoods.

Thus, I used bcftools 1.6 to find overlapping sets of SNPs. Note that versions with GATK genotype likelihoods are in the Variants_2018 subdirectory, whereas versions with samtools/bcftools genotype likelihoods are in the Variants_Samtools subdirectory. Note also that the gz files were all compressed with bgzip (this is required).

bcftools isec -p vcf_isec_gatk/ -n=2 -w1 covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf.gz ../Variants_Samtools/filtered2x_lyc_timeseries_samtbcft.vcf.gz &

bcftools isec -p vcf_isec_sam/ -n=2 -w1 filtered2x_lyc_timeseries_samtbcft.vcf.gz ../Variants_2018/covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf.gz

These files (0000.vcf) contain 12,886 SNPs, which along with the set of 30,033 SNPs in the GATK covsub* file will be the basis for my initial analyses.