Population allele frequences for GATK and samtools/bcftools variants/gls

Post date: Apr 17, 2019 2:23:44 PM

I am processing 3 vcf files. One with 30,033 variants from GATK, one with the subset of those variants (12,886) also found by samtools/bcftools, and a third one with the same variants but with genotype likelihoods from samtools/bcftools not GATK. From here on out, this is in projects not data, i.e., /uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_diversity/Variants.

1. Convert vcf to gl (no MAF filter).

perl vcf2gl.pl 0.0 ~/data/lycaeides/timeseries/Variants_2018/covsub_morefilter_filtered2x_lycTimeSeriesVariants.vcf

creates lyc_timeseries_GATK_full.gl with

30033; number of individuals 1536

perl vcf2gl.pl 0.0 ~/data/lycaeides/timeseries/Variants_2018/vcf_isec_gatk/0000.vcf

creates lyc_timeseries_GATK_sub.gl with

Number of loci: 12886; number of individuals 1536

perl vcf2glSamt.pl 0.0 ~/data/lycaeides/timeseries/Variants_Samtools/vcf_isec_sam/0000.vcf

creates lyc_timeseries_SAM_sub.gl with

Number of loci: 12886; number of individuals 1536

2. Split gl files by population and year for allele frequency estimation.

perl splitPops.pl lyc_timeseries_GATK_full.gl

perl splitPops.pl lyc_timeseries_GATK_sub.gl

perl splitPops.pl lyc_timeseries_SAM_sub.gl

3. Estimate allele frequencies. Because I plan to use this for downstream timeseries comparisons, getting ML estimates seems most appropriate. Thus, I am going with estpEM.

From the AlleleFreqs subdirectory, I ran:

perl RunEstpEMFork.pl *gl

which runs

#!/usr/bin/perl

# run estpEM as an interactive job

use Parallel::ForkManager;

my $max = 20;

my $pm = Parallel::ForkManager->new($max);

FILES:

foreach $file (@ARGV){

$pm->start and next FILES; ## fork

$file =~ m/([A-Z]+\-\d+)\S+_([A-Z]+_[a-z]+)\.gl/ or die "failed to match $file\n";

$out = "p_$1"."_$2".".txt";

system "~/bin/estpEM -i $file -o $out -m 50 -e 0.001 -h 2\n";

$pm->finish;

}

$pm->wait_all_children;

4. Comparison of allele frequencies and heterozygosity in R.

I compared allele frequencies and correlations in allele frequencies across the populations, years and data sets. There are two scripts for this: AfreqPCA.R and popgen.R.

AfreqPCA.R does some initial formatting and examines the effect of data set. The main take home from this is that while the data sets are similar, their is a greater sequence batch effect for the GATK genotype likelihoods, and it is essentially entirely absent from the samtools/bcftools genotype likelihoods.

popgen.R calculates heterozygosity and minor allele frequency distributions. These are really similar across populations, years and data sets.

Page updated

Google Sites

Report abuse