Post date: Oct 28, 2016 4:11:45 AM
Filter variants. First I linked LycWgVars.vcf to /uufs/chpc.utah.edu/common/home/u6000989/projects/lyc_hybanc/genome_variants/, where I am now working.
1. I then modified a local version of the vcfFilter.pl with the following options (the original file had 6,176,061 variants):
my $minCoverage = 30; # minimum number of sequences; DP
my $bqrs = 2.5; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maximum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2.5; # maximum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2.5; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 40; # minimum mapping quality; MQ
Note the lack of missing-ness filter (there are only five genomes). I played a bit with options with minimal effect.
perl vcfFilter.pl LycWgVars.vcf
Retained 2686796 variable loci
failed 0 = multi-allelic: 227490
failed 1 = INDEL: 1830005
failed 2 = DP: 18080
failed 3 = BQrankSum: 1386623
failed 4 = MQrankSum: 1658715
failed 5 = RPrankSum: 1435237
failed 6 = QD: 327399
failed 7 = MQ: 649750
The outfile is filtered6x_LycWgVars.vcf
2. A second filter was applied to remove SNPs with excessive depth (> mean + 3 s.d. = 1386) and those within 5 bp of each other.
perl filterSomeMore.pl filtered6x_LycWgVars.vcf
This left 1,999,219 SNPs in morefilter_filtered6x_LycWgVars.vcf.
3. Genotypes were then extracted from the file. Genotypes with at least 6 reads (~ 0.0156 prob of missing an allele) were retained, others were converted to NA.
perl getGenotypes.pl
LycWGSnpData.txt has the SNP information: scaffold, position, reference allele, alt. allele.
LycWGDat.txt has the genotype data: scaffold, position, genotype for all five taxa = L. anna, L. idas, L. melissa, Sierra Nevada, Warner Mts..