Post date: Apr 17, 2019 2:23:44 PM
I am processing 3 vcf files. One with 30,033 variants from GATK, one with the subset of those variants (12,886) also found by samtools/bcftools, and a third one with the same variants but with genotype likelihoods from samtools/bcftools not GATK. From here on out, this is in projects not data, i.e., /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Variants.
1. Convert vcf to gl (no MAF filter).
perl vcf2gl.pl 0.0 ~/data/lycaeides/timeseries/Variants_2018/covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf
creates lyc_timeseries_GATK_full.gl with
30033; number of individuals 1536
perl vcf2gl.pl 0.0 ~/data/lycaeides/timeseries/Variants_2018/vcf_isec_gatk/0000.vcf
creates lyc_timeseries_GATK_sub.gl with
Number of loci: 12886; number of individuals 1536
perl vcf2glSamt.pl 0.0 ~/data/lycaeides/timeseries/Variants_Samtools/vcf_isec_sam/0000.vcf
creates lyc_timeseries_SAM_sub.gl with
Number of loci: 12886; number of individuals 1536
2. Split gl files by population and year for allele frequency estimation.
perl splitPops.pl lyc_timeseries_GATK_full.gl
perl splitPops.pl lyc_timeseries_GATK_sub.gl
perl splitPops.pl lyc_timeseries_SAM_sub.gl
3. Estimate allele frequencies. Because I plan to use this for downstream timeseries comparisons, getting ML estimates seems most appropriate. Thus, I am going with estpEM.
From the AlleleFreqs subdirectory, I ran:
perl RunEstpEMFork.pl *gl
which runs
#!/usr/bin/perl
#
# run estpEM as an interactive job
#
use Parallel::ForkManager;
my $max = 20;
my $pm = Parallel::ForkManager->new($max);
FILES:
foreach $file (@ARGV){
$pm->start and next FILES; ## fork
$file =~ m/([A-Z]+\-\d+)\S+_([A-Z]+_[a-z]+)\.gl/ or die "failed to match $file\n";
$out = "p_$1"."_$2".".txt";
system "~/bin/estpEM -i $file -o $out -m 50 -e 0.001 -h 2\n";
$pm->finish;
}
$pm->wait_all_children;
4. Comparison of allele frequencies and heterozygosity in R.
I compared allele frequencies and correlations in allele frequencies across the populations, years and data sets. There are two scripts for this: AfreqPCA.R and popgen.R.
AfreqPCA.R does some initial formatting and examines the effect of data set. The main take home from this is that while the data sets are similar, their is a greater sequence batch effect for the GATK genotype likelihoods, and it is essentially entirely absent from the samtools/bcftools genotype likelihoods.
popgen.R calculates heterozygosity and minor allele frequency distributions. These are really similar across populations, years and data sets.