On 17th March 2017, I restarted the analysis for this project. This is the first step of the analysis which involves preparing the input file which I will use for all further analysis after BayPass and TreeMix.
FOLDER MAININPUTFILES
Folder bayPass: see below
popmodfiles : C++ files for the program popmod to calculate allele frequency estimates (see step 3)
runPopmodfork.pl = perl script to run popmod as an interactive job or on the cluster for all the populations at once
runPopmod.sh = bash script to run runPopmodfork.pl on the cluster
*************** BAYPASS ***************************
For bayPass analysis the files are in the following directory:
\projects\lmelissa_hostAdapt\maininputfiles\bayPass
The files here are as follows:
FOLDER BAYPASS:
filtered_varMelissaAll.vcf = this is the main vcf file I am using to create the genotype likelihood files. This file was also used by Zach for his analysis. This file contains 206,047 variants which are all the variants available. No filtering is done. Also this file just contains L. melissa populations. I will have to add L. anna populations when I do the Treemix analysis (see treeMix below)
LmelissaVariants.gl = gl file from the filtered_varMelissaAll.vcf file.
popindiv.txt = contains two columns: population and number of individuals in that population.
FOLDER: gl files : contains gl files for each L.melissa population included in this analysis.
FOLDER: popfreq_files : contains population allele frequency files for each L. melissa population after running the estEMpop for on gl files.
melissa_popallelefreqs.txt is combined file and contains all the population
FOLDER: popallelecounts_files : contains files for each population containing allele counts at each SNP. This file was made from the pop allele frequency files. See below for more details.
splitPops.pl = perl script to split the gl file into individual populations
vcf2gl.pl = perl script used to convert the vcf file to gl (or genotype likelihood file).
For treeMix analysis the files are in the following directory (will add more details as I prepare files for this analysis)
\projects\lmelissa_hostAdapt\maininputfiles\treeMix
For both these analysis, I used the following steps to prepare the final input file from the vcf file. This analysis includes 26 L. melissa populations. The final input files will be used for each of the analysis. The only difference in these files is the presence of 2 L. anna populations for TreeMix analysis as these populations are included as outgroup for this analysis.
STEP 1
Call : perl vcf2gl filtered_varMelissaAll.vcf
Use the vcf2gl. pl to convert the filtered_varMelissaAll.vcf file to LmelissaVariants.gl file. This perl script uses filters to change the file to gl file. There were two filters used : a). remove all fixed SNPs such that these SNPs were fixed for the alleles (19 SNVs in this case) b). include the biallelic snps in this file.
Number of SNVs present in the files:
filtered_varMelissaAll.vcf = 206,047
LmelissaVariants.gl = 206,028 (removed 19 fixed snvs, see above).
These gl files are saved in the following directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/gl_files
STEP 2
Call : perl splitPops.pl LmelissaVariants.gl
The splitPops.pl script is used to split the gl file into individual populations. These files are names as : Lmpop_ABM.gl etc.
STEP 3
Estimated population allele frequencies using Zach's estpEM program. This program only gives the population allele frequencies for each population. This program uses maximum likelihood allele frequency estimates by taking the gl files as input and outputting a text file with estimates.
Here is the usage and options for this program:
Usage: popmod -i infile.txt [options]
-i Infile with genetic data for the population
-o Outfile for allele frequency estimates [default = out.txt]
-e Tolerance for EM convergence [default = 0.001]
-m Maximum number of EM iterations [default = 20]
-h Number of header lines [default = 0]
Here is the example command I use (I modified this for each population):
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i Lmpop_WAL.gl -o p_WAL.txt -h 2
The output file has 3 columns:
Scaffold:position MLE MLE-foranalysis
0:33945 0.0658 0.0000
0:33977 0.0296 0.0000
0:33997 0.1872 0.1667
0:34052 0.0677 0.0000
0:50620 0.1239 0.0002
0:50681 0.1528 0.0709
0:68856 0.1586 0.0005
These files are saved in the following directory : /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/popfreq_files
An alternate way to do this is using a bayesian model to estimate both genotype and population allele frequencies using popmod. Since this program was taking forever to run we used the estpEM model above. Below I have the details for popmod in case I need it again in the future.
Estimate allele frequencies and genotypes using popmod which is Zach's C++ program.
Here are the notes from Zach about the program and what it is doing:
[I wrote a C++ program to estimate genotypes and population allele frequencies from NGS data using pre-calculated genotype likelihoods. The program is called popmod and is in cvs under this name. This program samples the posterior probability distribution from the following Bayesian model (this is the unnormalized posterior),
Pr(g, p | x, e) = Pr(x | g, e) Pr(g | p) Pr(p)
Here the genotype likelihoods given by Pr(x | g, e) are pre-calculated using samtools and bcftools. The genotype prior is the HW model, and Pr(p) is given by a beta distribution with fixed hyperprior (the default is Jeffery's prior, a = b = 0.5). Thus, all steps in the MCMC algorithm are Gibb's samples from the full conditional posterior distributions. I ran the model on a small example data set and it produced reasonable output.
The output is written to a HDF5 file. I still need to write a version of estpost to parse this file.]
For this I used the runPopmodfork.pl script and ran the popmod program on each population to generate hdf5 files. This script uses popmod program to calculate genotypes and population allele frequencies and has the following usage:
Usage: popmod -i infile.txt [options]
-i Infile with genetic data for the population
-n Number of MCMC steps for the analysis [default = 10000]
-b Discard the first n MCMC samples as a burn-in [default = 1000]
-t Thin MCMC samples by recording every nth value [default = 2]
-o HDF5 format outfile with .hdf5 suffix [default = mcmcout.hdf5]
-a Parameter for beta prior on allele frequencies [default = 0.5]
I used the following options and ran 15000 MCMC steps and burn-in 500 with 4 chains. Here is an example command prompt:
popmod -i Lmpop*.gl -n 15000 -b 500 -t 4 -o pop*.hdf5
STEP 4
Preparing the final input file from all the population allele frequency files above. To run bayPass and treemix I need a file with allele counts at each SNP for each population. To do this I calculated allele frequency counts from the population allele frequencies we calculated in the previous step for each population.
R code for this: This is an example for population ABC which has 19 individuals. The allele counts for a population with n individuals (n =19 for ABC) are calculated as:
allele 1 = allele frequency * 2n
allele 2 = 2n - allele 1
Here allele frequency is the population allele frequency calculated in the previous step for each population.
#Repeat for each population by changing n (number of individuals in the population)
#create a function to get allele counts and round off the counts
allcount<- function(x){
counta <- x * 92
countaa <- round(counta,0)
}
#apply this function to the allele freq column of each population vector in R
p.abc = sapply(abc[,3],allcount)
#get the second allele count and combine both
abc.counts = cbind(p.abc, 38-p.abc)
#finally combining all the counts to get the final input file
final = merge(abc.counts, abm.counts,bhp.counts,bsd.counts,cdy.counts,ckv.counts,dbq.counts,> r.counts,gla.counts,gvl.counts,lan.counts,lca.counts,mon.counts,mtu.counts,ocy.counts,rew.co> ts,scc.counts,sla.counts,suv.counts,svy.counts,tpt.counts,ual.counts,vcp.counts,vic.counts,w> .counts,ywp.counts,fcr.counts,ybg.counts)
colnames(final1) =c("abc","abm","bhp","bsd","cdy","ckv","dbq","dcr","gla","gvl","lan","lca","mon","mtu","ocy","rew","scc","sla","suv","svy","tpt","ual","vcp","vic","wal","ywp","fcr","ybg")
**************************************************************************
************************** TREEMIX ****************************************
For treeMix analysis the files are in the following directory:
\projects\lmelissa_hostAdapt\maininputfiles\treeMix
The files here are as follows:
FOLDER treeMix:
melissaVariants2.vcf = this is the main vcf file I am using to create the genotype likelihood files. This file contains 206,047 variants which are all the variants available. No filtering is done. Also this file contains L. melissa populations and L. anna populations for outgroup for the Treemix analysis.
LmelissaVariants_tm.gl = gl file from the melissaVariants2.vcf file.
popindiv.txt = contains two columns: population and number of individuals in that population. This file has more variants removed since the addition of L. anna individuals. Therefore, after filtering (same approach as baypass) this file has 184,111 variants.
FOLDER: gl files : contains gl files for each L.melissa population included in this analysis.
FOLDER: popfreq_files : contains population allele frequency files for each L. melissa population after running the estEMpop for on gl files.
melissa_popallelefreqs.txt is combined file and contains all the population
FOLDER: popallelecounts_files : contains files for each population containing allele counts at each SNP. This file was made from the pop allele frequency files. See below for more details.
splitPops.pl = perl script to split the gl file into individual populations
vcf2gl.pl = perl script used to convert the vcf file to gl (or genotype likelihood file).
For both these analysis, I used the following steps to prepare the final input file from the vcf file. This analysis includes 26 L. melissa populations. The final input files will be used for each of the analysis. The only difference in these files is the presence of 2 L. anna populations for TreeMix analysis as these populations are included as outgroup for this analysis.
STEP 1
Call : perl vcf2gl melissaVariants2.vcf
Use the vcf2gl. pl to convert the filtered_varMelissaAll.vcf file to LmelissaVariants.gl file. This perl script uses filters to change the file to gl file. There were two filters used : a). remove all fixed SNPs such that these SNPs were fixed for the alleles (19 SNVs in this case) b). include the biallelic snps in this file.
Number of SNVs present in the files:
melissaVariants2.vcf = 206,047
LmelissaVariants_tm.gl = Number of loci: 184110; number of individuals 565
(removed a lot of fixed snvs by addition of L. anna individuals, see above). This file does not add the number of loci in the first row of the gl file. We have to add this number manually otherwise we will get an error in the next step.
STEP 2
Call : perl splitPops.pl LmelissaVariants_tm.gl
The splitPops.pl script is used to split the gl file into individual populations. These files are names as : Lmpop_ABM.gl etc. There are 28 populations (26 L. melissa and 2 L. anna).
These gl files are saved in the following directory: /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/treeMix/gl_files
STEP 3
Estimated population allele frequencies using Zach's estpEM program. This program only gives the population allele frequencies for each population. This program uses maximum likelihood allele frequency estimates by taking the gl files as input and outputting a text file with estimates.
Here is the usage and options for this program:
Usage: popmod -i infile.txt [options]
-i Infile with genetic data for the population
-o Outfile for allele frequency estimates [default = out.txt]
-e Tolerance for EM convergence [default = 0.001]
-m Maximum number of EM iterations [default = 20]
-h Number of header lines [default = 0]
Here is the example command I use (I modified this for each population):
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpEM -i Lmpop_WAL.gl -o p_WAL.txt -h 2
The output file has 3 columns:
Scaffold:position MLE MLE-foranalysis
0:33945 0.0658 0.0000
0:33977 0.0296 0.0000
0:33997 0.1872 0.1667
0:34052 0.0677 0.0000
0:50620 0.1239 0.0002
0:50681 0.1528 0.0709
0:68856 0.1586 0.0005
These files are saved in the following directory : /uufs/chpc.utah.edu/common/home/u6007910/projects/lmelissa_hostAdaptation/maininputfile/bayPass/popfreq_files
An alternate way to do this is using a bayesian model to estimate both genotype and population allele frequencies using popmod. Since this program was taking forever to run we used the estpEM model above. Below I have the details for popmod in case I need it again in the future.
Estimate allele frequencies and genotypes using popmod which is Zach's C++ program.
Here are the notes from Zach about the program and what it is doing:
[I wrote a C++ program to estimate genotypes and population allele frequencies from NGS data using pre-calculated genotype likelihoods. The program is called popmod and is in cvs under this name. This program samples the posterior probability distribution from the following Bayesian model (this is the unnormalized posterior),
Pr(g, p | x, e) = Pr(x | g, e) Pr(g | p) Pr(p)
Here the genotype likelihoods given by Pr(x | g, e) are pre-calculated using samtools and bcftools. The genotype prior is the HW model, and Pr(p) is given by a beta distribution with fixed hyperprior (the default is Jeffery's prior, a = b = 0.5). Thus, all steps in the MCMC algorithm are Gibb's samples from the full conditional posterior distributions. I ran the model on a small example data set and it produced reasonable output.
The output is written to a HDF5 file. I still need to write a version of estpost to parse this file.]
For this I used the runPopmodfork.pl script and ran the popmod program on each population to generate hdf5 files. This script uses popmod program to calculate genotypes and population allele frequencies and has the following usage:
Usage: popmod -i infile.txt [options]
-i Infile with genetic data for the population
-n Number of MCMC steps for the analysis [default = 10000]
-b Discard the first n MCMC samples as a burn-in [default = 1000]
-t Thin MCMC samples by recording every nth value [default = 2]
-o HDF5 format outfile with .hdf5 suffix [default = mcmcout.hdf5]
-a Parameter for beta prior on allele frequencies [default = 0.5]
I used the following options and ran 15000 MCMC steps and burn-in 500 with 4 chains. Here is an example command prompt:
popmod -i Lmpop*.gl -n 15000 -b 500 -t 4 -o pop*.hdf5
STEP 4
Preparing the final input file from all the population allele frequency files above. To run bayPass and treemix I need a file with allele counts at each SNP for each population. To do this I calculated allele frequency counts from the population allele frequencies we calculated in the previous step for each population.
R code for this: This is an example for population ABC which has 19 individuals. The allele counts for a population with n individuals (n =19 for ABC) are calculated as:
allele 1 = allele frequency * 2n
allele 2 = 2n - allele 1
Here allele frequency is the population allele frequency calculated in the previous step for each population.
#Repeat for each population by changing n (number of individuals in the population)
abm = read.table("p_ABM.txt", header=FALSE)
#create a function to get allele counts and round off the counts
allcount<- function(x){
counta <- x * 92
countaa <- round(counta,0)
}
#apply this function to the allele freq column of each population vector in R
p.abm = sapply(abm[,3],allcount)
#get the second allele count and combine both
abm.counts = cbind(p.abm, 92-p.abm)
write.table(p.abm, file = "popcounts_ABM.txt", sep =",", row.names=FALSE, col.names=FALSE, quote=FALSE)
ABC 19
ABM 46
BHP 20
BSD 20
CDY 23
CKV 10
DBQ 20
DCR 20
GLA 20
GVL 18
LAN 24
LCA 20
MON 20
MTU 19
OCY 19
REW 20
SCC 16
SLA 18
SUV 20
SVY 20
TPT 13
UAL 20
VCP 20
VIC 20
WAL 20
YWP 20
source the Lmelissa bam files
/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_gbs/AssembliesHost
source the L anna bam files
/uufs/chpc.utah.edu/common/home/gompert-group1/data/lycaeides/lycaeides_gbs/AssembliesAnna
source bcf tools
~/../u6000989/bin/bcftools call