Pando plus variant calling and filtering with samtools/bcftools multiallelic variant caller

Post date: Nov 05, 2019 3:7:35 AM

We plan to ultimately try the multiallelic variant caller from bcftools and mutect2 from GATK. The latter requires specifying a reference (non-tumor) tree, which I think we can better do after playing some with the results from bcftools.

The mem alignments were used for variant calling. These are in /uufs/chpc.utah.edu/common/home/gompert-group1/data/aspen/gbs_pando_plus/Alignments_mem/.

We ran variant calling on the 296 trees from Pando and the two nearby clones. These are in pando_bams.txt. We ran,

sbatch SamVarCall.sh

#!/bin/sh

#SBATCH --time=240:00:00

#SBATCH --nodes=1

#SBATCH --ntasks=24

#SBATCH --account=gompert-kp

#SBATCH --partition=gompert-kp

#SBATCH --job-name=sam_varcall

#SBATCH --mail-type=FAIL

#SBATCH --mail-user=zach.gompert@usu.edu

#SAMTOOLs mpileup version 1.5 options used:

#C = adjust mapping quality; recommended:50, disable:0 [0]

#d = max per-file depth; avoids excessive memory usage [250]

#f = faidx indexed reference sequence file

#q = skip alignments with mapQ smaller than INT [0]

#Q = skip bases with baseQ/BAQ smaller than INT [13]

#g = generate genotype likelihoods in BCF format

#t = --output-tags LIST optional tags to output:DP,AD,ADF,ADR,SP,INFO/AD,INFO/ADF,INFO/ADR []

#BCFTOOLs call version 1.6 options used

#v = output variant sites only

#c/m = he original calling method (conflicts with -m) or alternative model for multiallelic and rare-variant calling (conflicts with -c)

#p = variant if P(ref|D)<FLOAT with -c [0.5]

#P = --prior <float-o mutation rate (use bigger for greater sensitivity), use with -m [1.1e-3]

#O = output type: 'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' (here it is 'v')

#o = write output to a file [standard output]

module load samtools

module load bcftools

cd /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Alignments_mem

samtools mpileup -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/genome/Potrs01-genome.fa -q 30 -Q 20 -g -I -t DP,AD -u -b pando_bams.txt -o pando_variants.bcf

bcftools call -v -m -p 0.01 -P 0.001 -O v -o pando_variants.vcf pando_variants.bcf

The variants file is in /uufs/chpc.utah.edu/common/home/gompert-group1/data/aspen/gbs_pando_plus/Variants_mem_bcftools/. It is pando_variants.vcf.

We next filtered variants as documented in filterVars.sh:

#!/bin/sh

# filter variant file with bcftools and my own perl scripts

# -g, --SnpGap <int> filter SNPs within <int> base pairs of an indel

# -G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass

## bcftools 1.6

#bcftools filter -g 3 -G 10 -o f1_pando_variants.vcf pando_variants.vcf

## vcftiler

#### stringency variables, edits as desired

#my $minCoverage = 592; # minimum number of sequences; DP, 592 = 2X, N = 296 trees

#my $minAltRds = 1; # minimum number of sequences with the alternative allele; AC

#my $notFixed = 1.0; # removes loci fixed for alt; AF

#my $bqrs = 0.01; # Mann-Whitney U test of Base Quality Bias; BQB

#my $mqrs = 0.01; # Mann-Whitney U test of Mapping Quality Bias; MQB

#my $rprs = 0.1; # Mann-Whitney U test of Read Position Bias; RPB

#my $mq = 30; # minimum mapping quality; MQ

#my $miss = 59; # maximum number of individuals with no data, 80% with data

perl scripts/vcfFilter.pl pando_variants.vcf

#Finished filtering pando_variants.vcf

#Retained 40847 variable loci

## generates filtered2x_pando_variants.vcf

grep ^Por filtered2x_pando_variants.vcf | cut -f 8 | perl -p -i -e 's/DP=(\d+)\S+/\1/' > filteredDepth.txt

## mean cov = 3471.314, sd = 3095.771, mean + 2 sd = 9662.857, filter at 9663 retains 95.9%

## bcftools filter again for excessive depth

bcftools filter -e 'DP>9663' -o filtered2xHiCov_pando_variants.vcf filtered2x_pando_variants.vcf

## retained 39164 in filtered2xHiCov_pando_variants.vcf

Note, for now we are dropping (filtering out) multallelic SNPs, but we will ultimately want to keep them. We are just getting rid of them to make initial analyses of population structure more straightforward.

Page updated

Google Sites

Report abuse