Post date: Nov 19, 2018 2:14:24 AM
Filtering ptr_variants.vcf in /uufs/chpc.utah.edu/common/home/u6000989/data/aspen/gbs_pando_plus/Variants/
Note, this is before determining ploidy and is part of the process required to generate the input files for gbs2ploidy.
Run sbatch vcfFilter.sh
which runs
#!/usr/bin/perl
use warnings;
use strict;
# this program filters a vcf file based on overall sequence coverage, number of non-reference reads, number of alleles, and reverse orientation reads
# usage vcfFilter.pl infile.vcf
#
# change the marked variables below to adjust settings
#
## 2016 samples, none that appear to be blanks
#### stringency variables, edits as desired
my $minCoverage = 4032; # minimum number of sequences; DP
my $bqrs = 3; # maximum absolute value of the base quality rank sum test; BaseQRankSum
my $mqrs = 2.5; # maxmum absolute value of the mapping quality rank sum test; MQRankSum
my $rprs = 2.5; # maxmum absolute value of the read position rank sum test; ReadPosRankSum
my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
my $mq = 20; # minimum mapping quality; MQ
my $fish = 60; #Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
my $miss = 404; ## data for 80%
##### this set is for whole genome shotgun data
my $d;
my $dp;
my $minacov;
my @line;
my @fail;
my $in = shift(@ARGV);
open (IN, $in) or die "Could not read the infile = $in\n";
$in =~ m/^([a-zA-Z_0-9]+\.vcf)$/ or die "Failed to match the variant file\n";
open (OUT, "> filtered2X_$1") or die "Could not write the outfile\n";
my $flag = 0;
my $cnt = 0;
foreach my $i (0..11){
$fail[$i]=0;
}
while (<IN>){
chomp;
$flag = 1;
if (m/^\#/){ ## header row, always write
$flag = 1;
}
elsif (m/^Potrs\d+/){ ## this is a sequence line, you migh need to edit this reg. expr.
$flag = 1;
if (m/[ACTGN]\,[ACTGN]/){ ## two alternative alleles identified
$flag = 0;
$fail[0]++;# > 2 alleles
}
@line = split(/\s+/,$_);
if(length($line[3]) > 1 or length($line[4]) > 1){
$flag = 0;
$fail[1]++;## indel
}
m/DP=(\d+)/ or die "Syntax error, DP not found\n";
$dp = $1;
if ($dp < $minCoverage){
$flag = 0;
$fail[2]++;## low coverae
}
if(m/BaseQRankSum=([0-9e\-\.\+]*)/){
if (abs($1) > $bqrs){
$flag = 0;
$fail[3]++;
}
}
if(m/MQRankSum=([0-9\-\.e\+]*)/){
if (abs($1) > $mqrs){
$flag = 0;
$fail[4]++;
}
}
if(m/ReadPosRankSum=([0-9\-\.e\+]*)/){
if (abs($1) < $rprs){
$flag = 0;
$fail[5]++;
}
}
if(m/QD=([0-9\.]+)/){
if ($1 < $qd){
$flag = 0;
$fail[6]++;
}
}
if(m/MQ=([0-9\.]+)/){
if ($1 < $mq){
$flag = 0;
$fail[7]++;
}
}
if(m/FS=([0-9e\-\.\+]*)/){
if ($1 > $fish){
$flag = 0;
$fail[8]++;
}
}
# if (m/MAPPOS-NA/){
# $flag = 0;
# $fail[9]++;
# }
$d = () = (m/\.\/\.:0,0:0/g);
if ($d >= $miss){
$flag = 0;
$fail[10]++;
}
if ($flag == 1){
$cnt++; ## this is a good SNV
}
else{
$fail[11]++;
}
}
else{
print "Warning, failed to match the chromosome or scaffold name regular expression for this line\n$_\n";
$flag = 0;
}
if ($flag == 1){
print OUT "$_\n";
}
}
close (IN);
close (OUT);
print "Finished filtering $in\nRetained $cnt variable loci\n";
foreach my $i (0..11){
print "failed $i : $fail[$i]\n";
}
I repeated filtering with 80 or 70% of individuals having data. The files are filtered2X_ptr_variants.vcf and filtered2X70p_ptr_variants.vcf, which contain 29,200 and 39,336 SNPs respectively. I am going to work with filtered2X70p_ptr_variants.vcf.