males_analysis

Folder for infiles and outfiles: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/complete/males

I am not rerunning the analysis on just males to check how important the sex chromosome is in all of this. For this, I created a modified file which has males and females added as (M or F) in all the individual IDs (subset_filtered_variantsLycaeides.gl). I then used the splitPops.pl script to split the file into a male and female file (Lmpop_M.gl and Lmpop_F.gl). I then manually deleted the line with 1111 to make sure there are no errors downstream. I then used the male file (Lmpop_M.gl) to run the analysis again.

The infile and scripts used to create this file is all saved in: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/common/males/infile

I am running entropy in this folder and then will move the rest of the files in different folders for further analysis.

Step 1: Entropy

Here I copied the perl scripts and R scripts in the folder similar to steps and folders described on the entropy page here: https://sites.google.com/site/gompertlabnotes/home/researcher-pages/samridhi-chaturvedi/hybridsdiapause/entropy

Note the infile here is Lmpop_M.gl and therefore scripts have been modified for directories and filenames.

Step 2: getting g and q files

After running entropy we end up with hdf5 files for each chain and K combination. Then I generated the genotype likelihood file using the following commands (note how output and input filenames change for each K:

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K2.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K2.hdf5

#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K3.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K3.hdf5

#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K4.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K4.hdf5

#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o g_K5.txt -p gprob -s 0 -w 0 ento_lyc_hybrids_commonCh*K5.hdf5

#parameter dimensions for gprob: loci = 39193, ind = 479, genotypes = 3, chains = 3

### generating q files (Got the point estimates and quantiles using the following command):

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K2.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K2.hdf5

#parameter dimensions for q: ind = 479, populations = 2, samples = 1000, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K3.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K3.hdf5

#parameter dimensions for q: ind = 479, populations = 3, samples = 1000, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K4.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K4.hdf5

#parameter dimensions for q: ind = 479, populations = 4, samples = 1000, chains = 3

/uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_entropy -o q_K5.txt -p q -s 0 -w 0 ento_lyc_hybrids_commonCh*K5.hdf5

#parameter dimensions for q: ind = 479, populations = 5, samples = 1000, chains = 3

First i needed to create a file which has the population order. I created this file in the/uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/entropy/complete/males/infile folder as follows

1. In bash: cat Lmpop_M.gl | head -2 | tail -1 > temp_pops.txt

2. Go to R:

x <- read.table("temp_pops.txt", sep = " ", header=F)

newx <- t(x)

write.table(newx, file="temp_popt.txt", quote=F, row.names=F,col.names=F)

3). In bash: cat temp_popt.txt | grep -E [A-Z] | cut -c 1-3 > poporder.txt

4. Copy the poporder file to the mcmc folder for further plotting.

Step 3: PCA and admixture proportions plot

I performed PCA and tested for admixture proportions. For this I used the code in plots.R. This script is saved in the mcmc folder. I then created the PCA and barplots and saved it in the same folder as pcabarplots.pdf. The PCA show the same population structure and genetic variation based on PC1 scores.

I then used the genotype estimate averages across each K (2 to 5) to create a file which I used for popanc analysis. This file is called males_genoest_entropy.txt. This is the file I used to prepare infiles for popanc.

Basically, I just wrote out the average genotype point estimates matrix by combining a column of populations.

Step 4: Popanc

A. Preparing the input files

To do this I first copied the males_genoest_entropy.txt file from the entropy/males folder to the popanc/males/infiles folder. I then went into R and subset the file for each population. I did this by using the following for loop:

mat<-read.table("males_genoest.gl", header=FALSE, sep=" ")

pops = c("BCR","BIC","BLD","BNP","BST","BTB","CDY","CKV","DBS","FRC","GNP","KHL","LAN","MON","PIN","PSP","RDL","RNV","SDC","SIN","SYC","VIC","YWP")

mat_t<-t(mat)

#for loop

for (i in pops){

newmat = mat_t[which(mat_t[,1]==i),]

newmat_t = t(newmat)

newmat_m =

write.table(newmat, file=i)

}

This created a file with each population name. I then changed the name of each of these files to lyc_genoest_*.txt.

## creating and editing aims files

sed -i "1s/^/$(head -n1 ../lyc_genoest_BCR.txt)\n/" *lyc_genoest_BCR.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_GNP.txt)\n/" *lyc_genoest_GNP.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_PSP.txt)\n/" *lyc_genoest_PSP.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_BIC.txt)\n/" *lyc_genoest_BIC.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_RDL.txt)\n/" *lyc_genoest_RDL.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_BTB.txt)\n/" *lyc_genoest_BTB.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_RNV.txt)\n/" *lyc_genoest_RNV.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_BNP.txt)\n/" *lyc_genoest_BNP.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_DBS.txt)\n/" *lyc_genoest_DBS.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_FRC.txt)\n/" *lyc_genoest_FRC.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_PIN.txt)\n/" *lyc_genoest_PIN.txt

sed -i "1s/^/$(head -n1 ../lyc_genoest_BLD.txt)\n/" *lyc_genoest_BLD.txt

perl -p -i -e 's/39193/4610/g' aims1_lyc_genoest*

perl -p -i -e 's/39193/2133/g' aims2_lyc_genoest*

perl -p -i -e 's/39193/1233/g' aims3_lyc_genoest*

The infiles are saved in the folder: /uufs/chpc.utah.edu/common/home/u6007910/projects/lyc_dubois/popanc/males/infiles

The aims files are in the folder called "aims" in the same infiles folder. I ran popanc for 3 aims cutoffs (top 10% = 4610, top 20% 2123 and top 30% 1223). These are the cutoffs I used for all downstream analysis and even for clines. The aims folder has the combined idas and melissa files. These are the same populations I used for all other analysis with all individuals. I then ran popanc using these files.

B. Output: hdf5 files (saved in the males/outfiles/ folder):

Convert the hdf5 files to text to conduct analysis:

for f in *.hdf5; do /uufs/chpc.utah.edu/common/home/u6000989/bin/estpost_popanc -o "$f.txt" -p popQ -s 0 $f; done

#change file extensions:

for file in *.hdf5.txt; do mv "$file" "${file/.hdf5.txt/.txt}"; done

For the R analysis and plotting, I needed the scaffold and position of SNPs in each AIMS group. I just used grep to do this:

#repeated this for each AIMS

grep -e -o "^[0-9]+\:[0-9]+" comb_idas_aims1.txt > mappos_aims1.txt

Results:

Here are the results for randomizations:

BLD: obs =64, xfold = 2.64904, p = <0

PIN obs = 66, xfold = 2.737363, p <=0

Step 5: Clines

Results: