allele frequency estimates

On 17th March 2017, I restarted the analysis for this project. This is the first step of the analysis which involves preparing the input file which I will use for all further analysis after BayPass and TreeMix.

FOLDER MAININPUTFILES

Folder bayPass: see below
popmodfiles : C++ files for the program popmod to calculate allele frequency estimates (see step 3)
runPopmodfork.pl = perl script to run popmod as an interactive job or on the cluster for all the populations at once
runPopmod.sh = bash script to run runPopmodfork.pl on the cluster

*************** BAYPASS ***************************

For bayPass analysis the files are in the following directory:

\projects\lmelissa_hostAdapt\maininputfiles\bayPass

The files here are as follows:

FOLDER BAYPASS:

filtered_varMelissaAll.vcf = this is the main vcf file I am using to create the genotype likelihood files. This file was also used by Zach for his analysis. This file contains 206,047 variants which are all the variants available. No filtering is done. Also this file just contains L. melissa populations. I will have to add L. anna populations when I do the Treemix analysis (see treeMix below)
LmelissaVariants.gl = gl file from the filtered_varMelissaAll.vcf file.
popindiv.txt = contains two columns: population and number of individuals in that population.
FOLDER: gl files : contains gl files for each L.melissa population included in this analysis.
1. FOLDER: popfreq_files : contains population allele frequency files for each L. melissa population after running the estEMpop for on gl files.
  1. melissa_popallelefreqs.txt is combined file and contains all the population
FOLDER: popallelecounts_files : contains files for each population containing allele counts at each SNP. This file was made from the pop allele frequency files. See below for more details.
splitPops.pl = perl script to split the gl file into individual populations
vcf2gl.pl = perl script used to convert the vcf file to gl (or genotype likelihood file).

For treeMix analysis the files are in the following directory (will add more details as I prepare files for this analysis)

\projects\lmelissa_hostAdapt\maininputfiles\treeMix

For both these analysis, I used the following steps to prepare the final input file from the vcf file. This analysis includes 26 L. melissa populations. The final input files will be used for each of the analysis. The only difference in these files is the presence of 2 L. anna populations for TreeMix analysis as these populations are included as outgroup for this analysis.

STEP 1

Call : perl vcf2gl filtered_varMelissaAll.vcf

Use the vcf2gl. pl to convert the filtered_varMelissaAll.vcf file to LmelissaVariants.gl file. This perl script uses filters to change the file to gl file. There were two filters used : a). remove all fixed SNPs such that these SNPs were fixed for the alleles (19 SNVs in this case) b). include the biallelic snps in this file.

Number of SNVs present in the files:

filtered_varMelissaAll.vcf = 206,047

LmelissaVariants.gl = 206,028 (removed 19 fixed snvs, see above).

These gl files are saved in the following directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/gl_files

STEP 2

Call : perl splitPops.pl LmelissaVariants.gl

The splitPops.pl script is used to split the gl file into individual populations. These files are names as : Lmpop_ABM.gl etc.

STEP 3

Estimated population allele frequencies using Zach's estpEM program. This program only gives the population allele frequencies for each population. This program uses maximum likelihood allele frequency estimates by taking the gl files as input and outputting a text file with estimates.

Here is the usage and options for this program:

Usage: popmod -i infile.txt [options]

-i Infile with genetic data for the population

-o Outfile for allele frequency estimates [default = out.txt]

-e Tolerance for EM convergence [default = 0.001]

-m Maximum number of EM iterations [default = 20]

-h Number of header lines [default = 0]

Here is the example command I use (I modified this for each population):

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i Lmpop_WAL.gl -o p_WAL.txt -h 2

The output file has 3 columns:

Scaffold:position MLE MLE-foranalysis

0:33945 0.0658 0.0000

0:33977 0.0296 0.0000

0:33997 0.1872 0.1667

0:34052 0.0677 0.0000

0:50620 0.1239 0.0002

0:50681 0.1528 0.0709

0:68856 0.1586 0.0005

These files are saved in the following directory : /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/popfreq_files

An alternate way to do this is using a bayesian model to estimate both genotype and population allele frequencies using popmod. Since this program was taking forever to run we used the estpEM model above. Below I have the details for popmod in case I need it again in the future.

Estimate allele frequencies and genotypes using popmod which is Zach's C++ program.

Here are the notes from Zach about the program and what it is doing:

[I wrote a C++ program to estimate genotypes and population allele frequencies from NGS data using pre-calculated genotype likelihoods. The program is called popmod and is in cvs under this name. This program samples the posterior probability distribution from the following Bayesian model (this is the unnormalized posterior),

Pr(g, p | x, e) = Pr(x | g, e) Pr(g | p) Pr(p)

Here the genotype likelihoods given by Pr(x | g, e) are pre-calculated using samtools and bcftools. The genotype prior is the HW model, and Pr(p) is given by a beta distribution with fixed hyperprior (the default is Jeffery's prior, a = b = 0.5). Thus, all steps in the MCMC algorithm are Gibb's samples from the full conditional posterior distributions. I ran the model on a small example data set and it produced reasonable output.

The output is written to a HDF5 file. I still need to write a version of estpost to parse this file.]

For this I used the runPopmodfork.pl script and ran the popmod program on each population to generate hdf5 files. This script uses popmod program to calculate genotypes and population allele frequencies and has the following usage:

Usage: popmod -i infile.txt [options]

-i Infile with genetic data for the population

-n Number of MCMC steps for the analysis [default = 10000]

-b Discard the first n MCMC samples as a burn-in [default = 1000]

-t Thin MCMC samples by recording every nth value [default = 2]

-o HDF5 format outfile with .hdf5 suffix [default = mcmcout.hdf5]

-a Parameter for beta prior on allele frequencies [default = 0.5]

I used the following options and ran 15000 MCMC steps and burn-in 500 with 4 chains. Here is an example command prompt:

popmod -i Lmpop*.gl -n 15000 -b 500 -t 4 -o pop*.hdf5

STEP 4

Preparing the final input file from all the population allele frequency files above. To run bayPass and treemix I need a file with allele counts at each SNP for each population. To do this I calculated allele frequency counts from the population allele frequencies we calculated in the previous step for each population.

R code for this: This is an example for population ABC which has 19 individuals. The allele counts for a population with n individuals (n =19 for ABC) are calculated as:

allele 1 = allele frequency * 2n

allele 2 = 2n - allele 1

Here allele frequency is the population allele frequency calculated in the previous step for each population.

#Repeat for each population by changing n (number of individuals in the population)

#create a function to get allele counts and round off the counts

allcount<- function(x){

counta <- x * 92

countaa <- round(counta,0)

}

#apply this function to the allele freq column of each population vector in R

p.abc = sapply(abc[,3],allcount)

#get the second allele count and combine both

abc.counts = cbind(p.abc, 38-p.abc)

#finally combining all the counts to get the final input file

final = merge(abc.counts, abm.counts,bhp.counts,bsd.counts,cdy.counts,ckv.counts,dbq.counts,> r.counts,gla.counts,gvl.counts,lan.counts,lca.counts,mon.counts,mtu.counts,ocy.counts,rew.co> ts,scc.counts,sla.counts,suv.counts,svy.counts,tpt.counts,ual.counts,vcp.counts,vic.counts,w> .counts,ywp.counts,fcr.counts,ybg.counts)

colnames(final1) =c("abc","abm","bhp","bsd","cdy","ckv","dbq","dcr","gla","gvl","lan","lca","mon","mtu","ocy","rew","scc","sla","suv","svy","tpt","ual","vcp","vic","wal","ywp","fcr","ybg")

**************************************************************************

************************** TREEMIX ****************************************

For treeMix analysis the files are in the following directory:

\projects\lmelissa_hostAdapt\maininputfiles\treeMix

The files here are as follows:

FOLDER treeMix:

melissaVariants2.vcf = this is the main vcf file I am using to create the genotype likelihood files. This file contains 206,047 variants which are all the variants available. No filtering is done. Also this file contains L. melissa populations and L. anna populations for outgroup for the Treemix analysis.
LmelissaVariants_tm.gl = gl file from the melissaVariants2.vcf file.
popindiv.txt = contains two columns: population and number of individuals in that population. This file has more variants removed since the addition of L. anna individuals. Therefore, after filtering (same approach as baypass) this file has 184,111 variants.
FOLDER: gl files : contains gl files for each L.melissa population included in this analysis.
1. FOLDER: popfreq_files : contains population allele frequency files for each L. melissa population after running the estEMpop for on gl files.
  1. melissa_popallelefreqs.txt is combined file and contains all the population
FOLDER: popallelecounts_files : contains files for each population containing allele counts at each SNP. This file was made from the pop allele frequency files. See below for more details.
splitPops.pl = perl script to split the gl file into individual populations
vcf2gl.pl = perl script used to convert the vcf file to gl (or genotype likelihood file).

STEP 1

Call : perl vcf2gl melissaVariants2.vcf

Number of SNVs present in the files:

melissaVariants2.vcf = 206,047

LmelissaVariants_tm.gl = Number of loci: 184110; number of individuals 565

(removed a lot of fixed snvs by addition of L. anna individuals, see above). This file does not add the number of loci in the first row of the gl file. We have to add this number manually otherwise we will get an error in the next step.

STEP 2

Call : perl splitPops.pl LmelissaVariants_tm.gl

The splitPops.pl script is used to split the gl file into individual populations. These files are names as : Lmpop_ABM.gl etc. There are 28 populations (26 L. melissa and 2 L. anna).

These gl files are saved in the following directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/treeMix/gl_files

STEP 3

Here is the usage and options for this program:

Usage: popmod -i infile.txt [options]

-i Infile with genetic data for the population

-o Outfile for allele frequency estimates [default = out.txt]

-e Tolerance for EM convergence [default = 0.001]

-m Maximum number of EM iterations [default = 20]

-h Number of header lines [default = 0]

Here is the example command I use (I modified this for each population):

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i Lmpop_WAL.gl -o p_WAL.txt -h 2

The output file has 3 columns:

Scaffold:position MLE MLE-foranalysis

0:33945 0.0658 0.0000

0:33977 0.0296 0.0000

0:33997 0.1872 0.1667

0:34052 0.0677 0.0000

0:50620 0.1239 0.0002

0:50681 0.1528 0.0709

0:68856 0.1586 0.0005

These files are saved in the following directory : /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/popfreq_files

Estimate allele frequencies and genotypes using popmod which is Zach's C++ program.

Here are the notes from Zach about the program and what it is doing:

Pr(g, p | x, e) = Pr(x | g, e) Pr(g | p) Pr(p)

The output is written to a HDF5 file. I still need to write a version of estpost to parse this file.]

Usage: popmod -i infile.txt [options]

-i Infile with genetic data for the population

-n Number of MCMC steps for the analysis [default = 10000]

-b Discard the first n MCMC samples as a burn-in [default = 1000]

-t Thin MCMC samples by recording every nth value [default = 2]

-o HDF5 format outfile with .hdf5 suffix [default = mcmcout.hdf5]

-a Parameter for beta prior on allele frequencies [default = 0.5]

I used the following options and ran 15000 MCMC steps and burn-in 500 with 4 chains. Here is an example command prompt:

popmod -i Lmpop*.gl -n 15000 -b 500 -t 4 -o pop*.hdf5

STEP 4

R code for this: This is an example for population ABC which has 19 individuals. The allele counts for a population with n individuals (n =19 for ABC) are calculated as:

allele 1 = allele frequency * 2n

allele 2 = 2n - allele 1

Here allele frequency is the population allele frequency calculated in the previous step for each population.

#Repeat for each population by changing n (number of individuals in the population)

abm = read.table("p_ABM.txt", header=FALSE)

#create a function to get allele counts and round off the counts

allcount<- function(x){

counta <- x * 92

countaa <- round(counta,0)

}

#apply this function to the allele freq column of each population vector in R

p.abm = sapply(abm[,3],allcount)

#get the second allele count and combine both

abm.counts = cbind(p.abm, 92-p.abm)

write.table(p.abm, file = "popcounts_ABM.txt", sep =",", row.names=FALSE, col.names=FALSE, quote=FALSE)

ABC 19

ABM 46

BHP 20

BSD 20

CDY 23

CKV 10

DBQ 20

DCR 20

GLA 20

GVL 18

LAN 24

LCA 20

MON 20

MTU 19

OCY 19

REW 20

SCC 16

SLA 18

SUV 20

SVY 20

TPT 13

UAL 20

VCP 20

VIC 20

WAL 20

YWP 20

source the Lmelissa bam files

/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_gbs/AssembliesHost

source the L anna bam files

/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_gbs/AssembliesAnna

source bcf tools

~/../u6000989/bin/bcftools call

Page updated

Google Sites

Report abuse