Post date: Sep 03, 2018 9:57:46 PM
Here are the analyses for the whole genome sequence data within the SVs. The steps for generating the variant file are described here. I created a symbolic link for this file to a projects directory, /uufs/chpc.utah.edu/common/home/u6000989/projects/timema_SV/popgen (the Variants sub-directory), as follows:
ln -s /uufs/chpc.utah.edu/common/home/gompert-group1/data/timema/combind_wgs_dovetailV3/variants_tcr/tcr_wgs_variants.vcf ./
1. Filtering the variant data with my own custom filters. I ran,
sbatch vcfFilter.sh
which runs
perl ../Scripts/vcfFilter.pl tcr_wgs_variants.vcf
which contains the following key filters (along with filters for INDELs and any SNPs not on LGs, which should have already been removed anyway).
#### stringency variables, edits as desired
my $minCoverage = 160; # minimum number of sequences; DP
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maxmum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2.5; # maxmum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 20; # minimum mapping quality; MQ
my $fish = 60; #Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
my $miss = 40;
This generates the file filtered1X_tcr_wgs_variants.vcf, which contains 3,297,450 SNPs, for downstream analyses.
2. Convert the vcf file to gl format and split by population. This was all done as an interactive job.
Convert to gl, not MAF filter (or 0.0)
perl ../Scripts/vcf2gl.pl 0.0 filtered1X_tcr_wgs_variants.vcf
This generated 3,213,501 SNPs for 158 individuals. Need to check why 2 missing.
NOTE: I noticed two individuals were missing. They were simply left out of the combine g.vcfs step. I went back and added them and redid the above. The new files have the _x bit. Now filtered1X_tcr_wgs_variants_x.vcf has 3,252,350 SNPs.
Here are the failed counts:
Retained 3252350 variable loci
failed 0 : 1317492
failed 1 : 6956181
failed 2 : 19595319
failed 3 : 105
failed 4 : 3571
failed 5 : 25578570
failed 6 : 46380
failed 7 : 2248419
failed 8 : 20747
failed 9 : 0
failed 10 : 34595757
failed 11 : 45398044
The related *gl file contains 3,252,350 SNPs and 160 individuals.