Folder for infiles and outfiles: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/complete/males
I am not rerunning the analysis on just males to check how important the sex chromosome is in all of this. For this, I created a modified file which has males and females added as (M or F) in all the individual IDs (subset_filtered_variantsLycaeides.gl). I then used the splitPops.pl script to split the file into a male and female file (Lmpop_M.gl and Lmpop_F.gl). I then manually deleted the line with 1111 to make sure there are no errors downstream. I then used the male file (Lmpop_M.gl) to run the analysis again.
The infile and scripts used to create this file is all saved in: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/common/males/infile
I am running entropy in this folder and then will move the rest of the files in different folders for further analysis.
Step 1: Entropy
Here I copied the perl scripts and R scripts in the folder similar to steps and folders described on the entropy page here: https://sites.google.com/site/gompertlabnotes/home/researcher-pages/samridhi-chaturvedi/hybridsdiapause/entropy
Note the infile here is Lmpop_M.gl and therefore scripts have been modified for directories and filenames.
Step 2: getting g and q files
After running entropy we end up with hdf5 files for each chain and K combination. Then I generated the genotype likelihood file using the following commands (note how output and input filenames change for each K:
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K2.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K2.hdf5
#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K3.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K3.hdf5
#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K4.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K4.hdf5
#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K5.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K5.hdf5
#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3
### generating q files (Got the point estimates and quantiles using the following command):
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K2.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K2.hdf5
#parameter dimensions for q: ind = 479, populations = 2, samples = 1000, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K3.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K3.hdf5
#parameter dimensions for q: ind = 479, populations = 3, samples = 1000, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K4.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K4.hdf5
#parameter dimensions for q: ind = 479, populations = 4, samples = 1000, chains = 3
/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K5.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K5.hdf5
#parameter dimensions for q: ind = 479, populations = 5, samples = 1000, chains = 3
First i needed to create a file which has the population order. I created this file in the/uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/complete/males/infile folder as follows
1. In bash: cat Lmpop_M.gl | head -2 | tail -1 > temp_pops.txt
2. Go to R:
x <- read.table("temp_pops.txt", sep = " ", header=F)
newx <- t(x)
write.table(newx, file="temp_popt.txt", quote=F, row.names=F,col.names=F)
3). In bash: cat temp_popt.txt | grep -E [A-Z] | cut -c 1-3 > poporder.txt
4. Copy the poporder file to the mcmc folder for further plotting.
Step 3: PCA and admixture proportions plot
I performed PCA and tested for admixture proportions. For this I used the code in plots.R. This script is saved in the mcmc folder. I then created the PCA and barplots and saved it in the same folder as pcabarplots.pdf. The PCA show the same population structure and genetic variation based on PC1 scores.
I then used the genotype estimate averages across each K (2 to 5) to create a file which I used for popanc analysis. This file is called males_genoest_entropy.txt. This is the file I used to prepare infiles for popanc.
Basically, I just wrote out the average genotype point estimates matrix by combining a column of populations.
Step 4: Popanc
A. Preparing the input files
To do this I first copied the males_genoest_entropy.txt file from the entropy/males folder to the popanc/males/infiles folder. I then went into R and subset the file for each population. I did this by using the following for loop:
mat<-read.table("males_genoest.gl", header=FALSE, sep=" ")
pops = c("BCR","BIC","BLD","BNP","BST","BTB","CDY","CKV","DBS","FRC","GNP","KHL","LAN","MON","PIN","PSP","RDL","RNV","SDC","SIN","SYC","VIC","YWP")
mat_t<-t(mat)
#for loop
for (i in pops){
newmat = mat_t[which(mat_t[,1]==i),]
newmat_t = t(newmat)
newmat_m =
write.table(newmat, file=i)
}
This created a file with each population name. I then changed the name of each of these files to lyc_genoest_*.txt.
## creating and editing aims files
sed -i "1s/^/$(head -n1 ../lyc_genoest_BCR.txt)\n/" *lyc_genoest_BCR.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_GNP.txt)\n/" *lyc_genoest_GNP.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_PSP.txt)\n/" *lyc_genoest_PSP.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_BIC.txt)\n/" *lyc_genoest_BIC.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_RDL.txt)\n/" *lyc_genoest_RDL.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_BTB.txt)\n/" *lyc_genoest_BTB.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_RNV.txt)\n/" *lyc_genoest_RNV.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_BNP.txt)\n/" *lyc_genoest_BNP.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_DBS.txt)\n/" *lyc_genoest_DBS.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_FRC.txt)\n/" *lyc_genoest_FRC.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_PIN.txt)\n/" *lyc_genoest_PIN.txt
sed -i "1s/^/$(head -n1 ../lyc_genoest_BLD.txt)\n/" *lyc_genoest_BLD.txt
perl -p -i -e 's/39193/4610/g' aims1_lyc_genoest*
perl -p -i -e 's/39193/2133/g' aims2_lyc_genoest*
perl -p -i -e 's/39193/1233/g' aims3_lyc_genoest*
The infiles are saved in the folder: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/popanc/males/infiles
The aims files are in the folder called "aims" in the same infiles folder. I ran popanc for 3 aims cutoffs (top 10% = 4610, top 20% 2123 and top 30% 1223). These are the cutoffs I used for all downstream analysis and even for clines. The aims folder has the combined idas and melissa files. These are the same populations I used for all other analysis with all individuals. I then ran popanc using these files.
B. Output: hdf5 files (saved in the males/outfiles/ folder):
Convert the hdf5 files to text to conduct analysis:
for f in *.hdf5; do /uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_popanc -o "$f.txt" -p popQ -s 0 $f; done
#change file extensions:
for file in *.hdf5.txt; do mv "$file" "${file/.hdf5.txt/.txt}"; done
For the R analysis and plotting, I needed the scaffold and position of SNPs in each AIMS group. I just used grep to do this:
#repeated this for each AIMS
grep -e -o "^[0-9]+\:[0-9]+" comb_idas_aims1.txt > mappos_aims1.txt
Results:
Here are the results for randomizations:
BLD: obs =64, xfold = 2.64904, p = <0
PIN obs = 66, xfold = 2.737363, p <=0
Step 5: Clines
Results: