Post date: Apr 29, 2016 7:7:5 PM
I am using samtools (v 1.2) and bcftools (v 1.3) for variant calling. Note that the latter uses the new bcftools call module with the multiallelic-caller. Here is the content from callvar.sh:
module load samtools
cd /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/lycaeides_gbs/AssembliesHybAncestry/
samtools mpileup -b bams -C 50 -d 250 -f /uufs/chpc.utah.edu/common/home/u6000989/data/lycaeides/melissa_genome/final.assembly.fasta -q 30 -Q 20 -I -u -g -t DP,DPR -o varsLycHybAncestry.bcf
bcftools call -v -m -p 0.01 -P 0.001 -O v -o varsLycHybAncestry.vcf varsLycHybAncestry.bcf
This resulted in 1,295,243 SNVs.
I am filtering variants with my own vcfFilter.pl script. Here are the key parameters:
#### stringency variables, edits as desired
my $minCoverage = 1844; # minimum number of sequences; DP, 1844 = 2X
my $minAltRds = 10; # minimum number of sequences with the alternative allele; AC
my $notFixed = 1.0; # removes loci fixed for alt; AF
my $mq = 30; # minimum mapping quality; MQ
my $miss = 92; # maximum number of individuals with no data
my $minMaf = 5; # minimum MAF, as allele count ~ 0.005
my $ratDifDir = 0.01; ## maximum proportion of reads in reverse (relative to avg.) orientation
##### this set is for whole genome shotgun data
Beyond this, I also drop any INDELs (shouldn't be any) or multi-allelic SNPs. Note that 92 inds. with missing data is 10%. I could try 20% if too few SNPs are called. The results (filtered2x_varsLycHybAncestry.vcf) will be in /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/variants/. Filtering resulted in 85,471 SNPs.
I decided that additional filtering was warranted (beyond filtered2x...). I considered using mapping and base quality bias (MQB and BQB): these are p-values that test the null that mapping or base qualities are the same for reads supporting the ref. vs. alternative allele.However, this seems to be mostly driven by information content (another case where effect estimates would be so much more useful than stupid p-values). So, I will not use this. Instead I am adding two additional filters:
1. Remove very high coverage (putative repeat region) SNPs. Mean depth (across all inds.) was 11,790 reads (12.78X). I plant to drop all reads with greater than mean + 3 s.d. coverage, that is the 1.3% of SNPs with > 38,438.96 reads.
2. Drop all SNPs withing X base pairs of each other (where X is 4 in this case).
These are accomplished with filterSomeMore.pl. The resulting file is morefilter_filtered2x_varsLycHybAncestry.vcf and includes 39,318 SNPs. Of these 4448 have a MAF > 5% (common variants), 10,950 have MAF > 1% and < 5% (uncommon variants), and 23,920 have MAF < 1% (rare variants). These are based on the vcf output.