I am doing the genomic clines analysis for Dubois. Here is a link to the bgc program used for this analysis and details about this program are here: https://sites.google.com/site/bgcsoftware/. Here are the steps used to do this analysis:
Folder where all files are saved: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/clines
Step 1: Preparing input files
There are three input files: admixedin.txt (allele count file for admixed population), p0in.txt (allele count file for parental species 0), p1in.txt (allele count file for parental species 1). This was done for AIMS snps which I selected for popanc analysis. I identified ancestry informative markers (AIMs), here defined as those with an allele frequency difference of > 0.2, 0.3 and 0.1. I used the combined L. idas (populations SDY, SDC and KHL) and L. melissa (populations: BST, SIN, CDY, CKV) as the parents to delimit the AIMS.
Since the input format is gl files, I had to create gl files for each AIMS category for idas, melissa and admixed (DBS) populations. I did this using the following steps:
I copied the variants file from the Variants folder and this will have all the populations and all the filtered variants (npop = 835, nloci = 39193). I wanted to cut the original SNP set down to ancestry informative markers, in this case using a minimum difference of 0.2, 0.3 and 0.1. So I used the information from popanc results, to identify the list of SNPs to keep and I generated this list in R.
## identify a subset of loci to keep based on differences in idas amd melissa parent allele frequency differences using the combined files.
#read in files
idas_pure<-read.table("comb_idas_pure.txt", skip=1, header=F)
idas_pure_m<-as.matrix(idas_pure[,-1])
idas_pure_p<-apply(idas_pure_m,1,mean)/2
mel_sub<-read.table("comb_melissa_sub.txt", skip=1, header=F)
mel_sub_m<-as.matrix(mel_sub[,-1])
mel_sub_p<-apply(mel_sub_m,1,mean)/2
### getting aims loci
dp<-abs(idas_pure_p-mel_sub_p)
sum(dp>0.2)
hidif1<-dp>0.1
hidif2<-dp>0.2
hidif3<-dp>0.3
length(hidif1)
length(hidif2)
head(hidif1)
#write out the files
write.table(as.numeric(hidif1), "../../clines/difSnps10pct.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
write.table(as.numeric(hidif2), "../../clines/difSnps20pct.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
write.table(as.numeric(hidif3), "../../clines/difSnps30pct.txt", row.names=FALSE, col.names=FALSE, quote=FALSE)
I then ran subSetSnps.pl on the three genotype files (for each AIMS difference) to generate the sub* files, e.g.,
perl subSetSnps.pl difSnps10pct.txt sub_variantsLycaeides1.gl #4434 snps
perl subSetSnps.pl difSnps20pct.txt sub_variantsLycaeides2.gl #2126 snps
perl subSetSnps.pl difSnps30pct.txt sub_variantsLycaeides3.gl #1164 snps
Now I used these sub_variantsLycaeides*.gl file to generate the parents and admixed-DBS gl files for bgc. I used the following steps to generate final input files.
Use the script splitpops.pl to split files for each percentage 1,2,3 and then use the split pop gl files to create final input files.
Copy the Lmpop files into a temp file and remove header line by going into vim.
Then remove the mappos in all the pops except one so that we can retain that in the final file. To do this use the following and create out files with no mappos and no header.
sed 's/^[0-9]+:[0-9]+//' temp_frc.gl | cut -d " " -f2- > outfrc.gl
Combine the files together using paste
paste -d' ' temp_bld.gl outfrc.gl > idas_p01_pct1.gl
Using vim add nind and nloci in the final file.
sed 's/^[0-9]+:[0-9]+//' temp_sin.gl | cut -d " " -f2- > outsin.gl
sed 's/^[0-9]+:[0-9]+//' temp_ckv.gl | cut -d " " -f2- > outckv.gl
paste -d' ' temp_lan.gl outsin.gl outckv.gl > melissa_p01_pct1.gl
I have moved all the files created in this process to the folder glfiles. This has sub folders for each AIMS cutoff: pct01, pct02, pct03.
Step 2: Running bgc
Now I have all the input files for running bgc. Note in file names: species_parent_aimsCutoff. These are listed here:
Parent 01 IDAS: idas_p01_pct1.gl, idas_p01_pct2.gl, idas_p01_pct3.gl
Parent 02 MELISSA: melissa_p02_pct1.gl melissa_p02_pct2.gl melissa_p02_pct3.gl
Admixed, DBS: admixed_dbs_pct1.gl admixed_dbs_pct2.gl admixed_dbs_pct3.gl
I then used bgc to fit the genomic clines, via the perl fork script forkgenomicclines.pl. I submitted this script to run on the cluster via the bash script rungenomicclines_fork.sh. I ran 5 chains, here is one example command (note that I am running bgc from ZG's bin directory):
system "/uufs/chpc.utah.edu/common/home/u6000989/bin/bgc -a $dir"."idas_p01_pct1.gl -b $dir"."melissa_p02_pct1.gl -h $dir"."admixed_dbs_pct1.gl -F $odir"."bgcout_pct1_$j -O 0 -x 25000 -n 5000 -t 5 -p 1 -q 1 -N 1 -m 0 -d 1 -s 1 -I 1 -T 0";
The outfiles for this script/analysis will be saved in the outfiles folder as hdf5 files.
Step 3: Estimating posteriors using Estpost_bgc (This was all done in the outfiles folder)
Once I had the hdf5 files I used estpost_bgc program to convert these files into files containing means, medians and CIs for the genomic cline information. I generated two files: alpha and beta. I did using a for loop in bash:
for i in *.hdf5; do
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_bgc -i $i -s 0 -w -o -p beta
mv postout.txt ${i%.hdf5.txt}_beta.txt;
done
I then used R to plot the alpha and beta files and to check correlations between plots. I need to do more plotting for this....
Step 4: Getting medians across for MCMCs across 0-5 chains for alpha and beta
To do predictability tests, I used the MCMC outputs for each chain. To do this, first I used the following bash loop to output the MCMC point data for each chain:
for i in *.hdf5; do
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_bgc -i $i -s 2 -w -o -p alpha
mv postout.txt ${i%.hdf5.txt}_mcmc_alpha.txt;
done
for i in *.hdf5; do
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_bgc -i $i -s 2 -w -o -p beta
mv postout.txt ${i%.hdf5.txt}_mcmc_beta.txt;
done
Then in R:
## read in the mcmc files for alpha ####
pct3_0<-read.table("bgcout_pct3_0.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct3_1<-read.table("bgcout_pct3_1.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct3_2<-read.table("bgcout_pct3_2.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct3_3<-read.table("bgcout_pct3_3.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct3_4<-read.table("bgcout_pct3_4.hdf5_mcmc_alpha.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct3_0)
dim(pct3_1)
dim(pct3_2)
dim(pct3_3)
dim(pct3_4)
## concatenate the matrices to create on giant matrix ###
alpha_mat<-cbind(pct3_0,pct3_1,pct3_2,pct3_3,pct3_4)
## check
dim(alpha_mat)
## take median across columns
alpha_median<-apply(alpha_mat,1,median)
length(alpha_median) #1164
## read in the mcmc files for beta ####
pct3_0b<-read.table("bgcout_pct3_0.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct3_1b<-read.table("bgcout_pct3_1.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct3_2b<-read.table("bgcout_pct3_2.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct3_3b<-read.table("bgcout_pct3_3.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct3_4b<-read.table("bgcout_pct3_4.hdf5_mcmc_beta.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct3_0b)
dim(pct3_1b)
dim(pct3_2b)
dim(pct3_3b)
dim(pct3_4b)
## concatenate the matrices to create on giant matrix ###
beta_mat<-cbind(pct3_0b,pct3_1b,pct3_2b,pct3_3b,pct3_4b)
## check
dim(beta_mat)
## take median across columns
beta_median<-apply(beta_mat,1,median)
length(beta_median) #1164
## write out the file
## combine alpha and beta medians
alpha_beta<-cbind(alpha_median, beta_median)
colnames(alpha_beta)<-c("alpha_median","beta_median")
head(alpha_beta)
write.table(alpha_beta, file="pct3_alphabeta_mcmcmedian.txt", col.names=FALSE, row.names=FALSE, quote=FALSE )
#### pct 1#####
## read in the mcmc files for alpha ####
pct1_0<-read.table("bgcout_pct1_0.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct1_1<-read.table("bgcout_pct1_1.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct1_2<-read.table("bgcout_pct1_2.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct1_3<-read.table("bgcout_pct1_3.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct1_4<-read.table("bgcout_pct1_4.hdf5_mcmc_alpha.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct1_0)
dim(pct1_1)
dim(pct1_2)
dim(pct1_3)
dim(pct1_4)
## concatenate the matrices to create on giant matrix ###
alpha_mat<-cbind(pct1_0,pct1_1,pct1_2,pct1_3,pct1_4)
## check
dim(alpha_mat)
## take median across columns
alpha_median<-apply(alpha_mat,1,median)
length(alpha_median) #1164
## read in the mcmc files for beta ####
pct1_0b<-read.table("bgcout_pct1_0.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct1_1b<-read.table("bgcout_pct1_1.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct1_2b<-read.table("bgcout_pct1_2.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct1_3b<-read.table("bgcout_pct1_3.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct1_4b<-read.table("bgcout_pct1_4.hdf5_mcmc_beta.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct1_0b)
dim(pct1_1b)
dim(pct1_2b)
dim(pct1_3b)
dim(pct1_4b)
## concatenate the matrices to create on giant matrix ###
beta_mat<-cbind(pct1_0b,pct1_1b,pct1_2b,pct1_3b,pct1_4b)
## check
dim(beta_mat)
## take median across columns
beta_median<-apply(beta_mat,1,median)
length(beta_median) #1164
## write out the file
## combine alpha and beta medians
alpha_beta<-cbind(alpha_median, beta_median)
colnames(alpha_beta)<-c("alpha_median","beta_median")
head(alpha_beta)
write.table(alpha_beta, file="pct1_alphabeta_mcmcmedian.txt", col.names=FALSE, row.names=FALSE, quote=FALSE )
############## aims 2 #############################################
## read in the mcmc files for alpha ####
pct2_0<-read.table("bgcout_pct2_0.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct2_1<-read.table("bgcout_pct2_1.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct2_2<-read.table("bgcout_pct2_2.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct2_3<-read.table("bgcout_pct2_3.hdf5_mcmc_alpha.txt", header=F, sep = ",")
pct2_4<-read.table("bgcout_pct2_4.hdf5_mcmc_alpha.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct2_0)
dim(pct2_1)
dim(pct2_2)
dim(pct2_3)
dim(pct2_4)
## concatenate the matrices to create on giant matrix ###
alpha_mat<-cbind(pct2_0,pct2_1,pct2_2,pct2_3,pct2_4)
## check
dim(alpha_mat)
## take median across columns
alpha_median<-apply(alpha_mat,1,median)
length(alpha_median) #1164
## read in the mcmc files for beta ####
pct2_0b<-read.table("bgcout_pct2_0.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct2_1b<-read.table("bgcout_pct2_1.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct2_2b<-read.table("bgcout_pct2_2.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct2_3b<-read.table("bgcout_pct2_3.hdf5_mcmc_beta.txt", header=F, sep = ",")
pct2_4b<-read.table("bgcout_pct2_4.hdf5_mcmc_beta.txt", header=F, sep = ",")
## check the dimensions of the matrices ###
dim(pct2_0b)
dim(pct2_1b)
dim(pct2_2b)
dim(pct2_3b)
dim(pct2_4b)
## concatenate the matrices to create on giant matrix ###
beta_mat<-cbind(pct2_0b,pct2_1b,pct2_2b,pct2_3b,pct2_4b)
## check
dim(beta_mat)
## take median across columns
beta_median<-apply(beta_mat,1,median)
length(beta_median) #1164
## write out the file
## combine alpha and beta medians
alpha_beta<-cbind(alpha_median, beta_median)
colnames(alpha_beta)<-c("alpha_median","beta_median")
head(alpha_beta)
write.table(alpha_beta, file="pct2_alphabeta_mcmcmedian.txt", col.names=FALSE, row.names=FALSE, quote=FALSE )