Post date: Oct 30, 2019 11:1:11 PM
Working in /uufs/chpc.utah.edu/common/home/gompert-group2/projects/tcuri_islands/variants
1. Variant filtering was done with a mixture of my vcfFilter.pl script and bcftools (1.6). Details are in filterVars.sh, which is pasted below. From here, I am working with
final_tcuri_wgs_variants.vcf which has 1,210,820 SNPs.
#!/bin/sh
# filter variant file with bcftools and my own perl scripts
# -g, --SnpGap <int> filter SNPs within <int> base pairs of an indel
# -G, --IndelGap <int> filter clusters of indels separated by <int> or fewer base pairs allowing only one to pass
## bcftools 1.6
bcftools filter -g 3 -G 10 -o f1_tcuri_wgs_variants.vcf tcuri_wgs_variants.vcf
## vcftiler
#### stringency variables, edits as desired
#my $minCoverage = 40; # minimum number of sequences; DP
#my $minCount = 4; ## minmum number of alternative alleles
#my $bqrs = 3; # minimum absolute value of the base quality rank sum test; BaseQRankSum
#my $mqrs = 3; # minimum absolute value of the mapping quality rank sum test; MQRankSum
#my $rprs = 3; # minimum absolute value of the read position rank sum test; ReadPosRankSum
##my $qd = 2; # minimum ratio of variant confidenct to non reference read depth; QD
#my $mq = 30; # minimum mapping quality; MQ
#my $fish = 50; #Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls.
## also filters for indels, multi-allelic, maf < 0.001
perl scripts/vcfFilter.pl f1_tcuri_wgs_variants.vcf
#Finished filtering f1_tcuri_wgs_variants.vcf
#Retained 1251227 variable loci
#failed 0 : 215055
#failed 1 : 2492146
#failed 2 : 11892515
#failed 3 : 71
#failed 4 : 766
#failed 5 : 104
#failed 6 : 62706
#failed 7 : 5279251
#failed 8 : 3676
#failed 9 : 0
#failed 10 : 5220821
#failed 11 : 6966241
#failed 12 : 0
#failed 13 : 14527221
## mean cov = 208.2, sd = 457.8, mean + 2 sd = 1159.8, filter at 1160 retains 96.7%
## bcftools filter again for excessive depth
bcftools filter -e 'DP>1160' -o final_tcuri_wgs_variants.vcf filtered1X_f1_tcuri_wgs_variants.vcf
## retained 1210820 SNPs in final_tcuri_wgs_variants.vcf
2. Convert to gl format. I ran,
perl scripts/vcf2gl.pl final_tcuri_wgs_variants.vcf
to generate final_tcuri_wgs_variants.gl. This file contains 1,210,820 SNPs and 40 individuals.
3. Split by host population, CY vs. C
perl scripts/splitPops.pl final_tcuri_wgs_variants.gl
C: 20 1210820
CY: 20 1210820
4. Obtain maximum likelihood allele frequency estimates
estpEM -i tcuri_C.gl -o p_tcuri_C.txt -e 0.01 -m 50 -h 0
Reading data from tcuri_C.gl
Number of loci: 1210820
Number of individuals: 20
Using EM algorithm to estimate allele frequencies
Writing results to p_tcuri_C.txt
Runtime: 0 hr 0 min 25 sec
estpEM -i tcuri_CY.gl -o p_tcuri_CY.txt -e 0.01 -m 50 -h 0
Reading data from tcuri_CY.gl
Number of loci: 1210820
Number of individuals: 20
Using EM algorithm to estimate allele frequencies
Writing results to p_tcuri_CY.txt
Runtime: 0 hr 0 min 24 sec
Extract MLE
cut -f 3 -d " " p_tcuri_C.txt > mlp_tcuri_C.txt
cut -f 3 -d " " p_tcuri_CY.txt > mlp_tcuri_CY.txt
Correlation in allele freqs. between populations:
t = 1019.3, df = 1210800, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6786154 0.6805326
sample estimates:
cor
0.6795752