Folder details and important files in the folder:
All analysis for Treemix was done in Lmelissa_host_adaptation folder and further analysis files are in Lmelissa_Treemix folder.
Further folders are as follows and have the respective files:
fileReplicates - contains all the replicate.cov,gz files from treemix analysis
files_genotype - contains all g_ABC.txt etc files
file_phenotype - contains all p_ABC.txt etc files
files_tee - contains all melissa_tree files. This is a copy of files.
treemix_files - this folder contains all the melissa_tree files and output files + scripts
scripts - contains all the shell scripts
Sumtree_finaltrees - this folder is saved on my desktop and has all the final tree files from sumtree
RAN SAMtools ON THE L.MELISSA DATA.
STEP 1
The total number of populations is 565 which is the number of lines in the locuslist.txt. The list of bam files including L.melissa populations and L.ana populations which we are using as a outgroup to construct the tree in TreeMix. The list of bam files is saved in bamList.txt. The list of loci for these populations is saved in locuslist.txt.
Ran the samtools mpileup on the final alignment file on the cluster with following steps:
a).Created the following script in nano:
command used: vim nano
script saved as runmelissaVariants.sh
#!/bin/sh
#PBS -N samtools
#PBS -l nodes=1:ppn=1
#PBS -l walltime=96:00:00
#PBS -l mem=24g
#PBS -q batch
. /rc/tools/utils/dkinit
reuse SAMtools
cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams
samtools mpileup -C 50 -d 50 -l locuslist.txt -f ../../final.assembly.fasta -q 10 -Q 15 -D -g -I aln*.bam > melissaVariants.bcf
b). submit the job to the DoRC with the following command:
qsub runmelissaVariants.sh
c). To check the status of the job, use the following command:
qstat runmelissaVariants.sh
or
qstat -u A02079361
The job number submitted to the cluster is 102598. The output will be saved as slurm102598.out
STEP 2
Submitted two scripts to DoRC with the following commands in SAMtools saved in shell scripts called runmelissaVariants1. sh (with -v) and runmelissaVariants2.sh (without -v)
#!/bin/sh
#PBS -N samtools
#PBS -l nodes=1:ppn=1
#PBS -l walltime=96:00:00
#PBS -l mem=24g
#PBS -q batch
. /rc/tools/utils/dkinit
reuse SAMtools
cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams
bcftools view -c -d 0.8 -g -p 0.01 -P full -t 0.001 -v -l ../locuslist.txt melissaVariants.bcf > melissaVariants1.vcf
-v option: output variants file only
the analysis gave weird results with the -d option. So we REMOVED the -d option and got the results. The melissaVariants2.vcf is the working file and will be used for further analysis.
My final input script was the following:
#!/bin/sh
#PBS -N samtools
#PBS -l nodes=1:ppn=1
#PBS -l walltime=96:00:00
#PBS -l mem=24g
#PBS -q batch
. /rc/tools/utils/dkinit
reuse SAMtools
cd /labs/evolution/projects/Lmelissa_host_adaptation/variants/bams
bcftools view -c -g -p 0.01 -P full -t 0.001 -v -l ../locuslist.txt melissaVariants.bcf > melissaVariants2.vcf
The script is saved in the directory cd /labs/evolution/projects/Lmelissa_hostplantadaptation
The melissaVariants2.vcf is in the bams folder in variants folder in Lmelissa_hostplant
To check the number word counts for the scaffolds in the given locuslist I used the following command:
grep -c *scaf path/melissaVariants2.vcf
STEP 2
Now I converted the vcf file Lmelissa_variants.vcf to a gl file by using the perl script vcf2gl.pl and got the outfile LmelissaVariants.gl
perl ../scripts/vcf2gl.pl Lmelissa_variants.pl
Next I split the populations in the previous file using the perl script splitPops.pl and received population files like LmpopYBG..etc.
perl ../scripts/splitPops.pl LmelissaVariants.plhttps://sites.google.com/site/gompertlabnotes/home/samridhi-chaturvedi/l-melissa-genomic-analysis
STEP 3
ZG did the following:
Go to the directory labs/evolution/scratch/melissa. Has all the hdf5 files for the two mcmc runs as 0 and 1.
./estpost_popmod
#assign interactive job on DoRC
salloc --nodes=1 --ntasks-per-node=12 -p interactive
#running the estpost_popmod
./estpost_popmod -o p_ABC.txt -p p -s 0 -w 0 Lmpop_ABCCh*hdf5
./estpost_popmod -o g_ABC.txt -p g -s 0 -w 0 Lmpop_ABCCh*hdf5
SUBMITTING THE PERL SCRIPT calCG.pl to do the above task for all populations at once
chmod a+x calcG.pl
ls -lh calcG.pl
./calcG.pl Lmpop_*0.hdf5
#check these files in HDFView.
HDFview will have 2 files: p and q
1). g(or genotype-shows posterior probability for that genotype in the population)
This will generate 3 files:0,1,2
0 is for reference homozygous (AA)
1 is for heterozygous (Ab)
2 is for alternate homozygous (BB)
Each row is the locus. In this file the loci are 184287. The columns are the post. prob. for that genotype for each individuals. In this population there are 19 individuals.
-If the post. prob value is 1, then the genotype is taken as true.
-If the value is in decimals, then the estpost_popmod script/programme will calculate the weighted average of the values in the three files 0,1,2 for that individual at that locus.
2). p)or population allele frequencies)
Row is the locus.
Column is the allele frequency generated at each step of the mcmc chain which is 2000 in this file as we gave 2 mcmc chains to run.
Go to R
G<-matrix(scan("g_ABC.txt",n=184287*19,sep=","),nrow=19,ncol=184287,byrow=T)
#Note that 19 is the population size for this population and will vary for all populations.
dim(G)
P<-apply(G,2,sum)
P<-round(apply(G,2,sum),0)
#38 will change according to population size. It will be population size*2 as the population is diploid so the genotype will double.
PP<-cbind(P,38-P)
write.table(PP,"pp_ABC.txt",sep=",",row.names=F,col.names=F)
After creating all the p and g files for all the populations, I created the input file for TreeMix analysis.
Goto R
P1<-read.table("p_ABC.txt")
P2<-read.table("p_BSY.txt")
Lmelissa_inputfile<-merge(P1,P2)
colnames(Lmelissa_inputfile)<-c("ABC","BSY")
write.table(Lmelissa_inputfile,"Lmelissa_inputfile",col.names=T,row.names=F,quote=F)
The input file for Treemix is saved as Lmelissa_inputfile in the directory projects/Lmelissa_host_adaptation/Lmelissa_treemix
STEP 4
TREEMIX - All the files are saved in /labs/evolution/projects/Lmelissa_host_adaptation/Lmelissa_Treemix/treemix_files
Created the following script for submitting the job to run treemix with 0-10 migrations events.
#!/bin/sh
#PBS -N treemix
#PBS -l nodes=1:ppn=1
#PBS -l walltime=96:00:00
#PBS -l mem=24g
#PBS -q batch
. /rc/tools/utils/dkinit
use Boost
use Datatools-Serial-gfortran-gcc
/home/A02079361/bin/treemix -i Lmelissa_inputfile.txt.gz -m 0 -root YBG,FCR -o melissa_tree0
The analysis gives four output files:
1.outstem.cov.gz. The covariance matrix between populations estimated from the data
2. outstem.covse.gz. The standard errors for each entry in the covariance matrix
3. outstem.modelcov.gz. The fitted covariance according to the model
4. outstem.treeout.gz. The fitted tree model and migration events
5. outstem.vertices.gz. This and the following file (outstem.edges.gz) contain the internal
structure of the inferred graph. Modifying these files will cause issues if you try to read the
graph back in, so we recommend against this.
6. outstem.edges.gz.
We visualize the tree from the *.treeout.gz file in R as follows:
source(~/treemix-1.12/src/plotting_funcs.R)
plot_tree("melissa_tree1")
To chose the tree we first got the likelihood estimate with the following script saved in modPve.pl
I gave the following command to run the script:
perl modPve.pl melissa_tree*vert* &
The script gave the following output files: pveTreemixMelissa.txt, Rplots.pdf(contains the residual plot), sourceout
To check which tree gives the best result we constructed a graph with No. of migrations(x-axis) versus pve(y-axis) saved as tm.pdf. We chose the tree with 2 migrations because the one with 0 migrations explains 95% of the variance. The next difference is with 1 migration event and then the next is 2 migration event. After 2 migrations the variation does not increase considerably.
SUMTREES
mountrc
go to schaturvedi/trees
head -n 1 ~/labs/evolution/projects/Lmelissa_host_adaptation/Lmelissa_Treemix/replicate99.treeout > test1
head -n 1 ~/labs/evolution/projects/Lmelissa_host_adaptation/Lmelissa_Treemix/replicate9.treeout > test2
sumtrees.py --decimals=0 --percentages --output=result.tre --target=test1 test2
sumtrees.py --decimals=0 --percentages --output=result.tre --target=best.tre treefile.tre
sumtrees.py --decimals=0 --percentages --rooted --output=result_rooted.tre --target=best.tre treefile.tre
The final tree used for further analysis is result_rooted.tre.
source("/uufs/chpc.utah.edu/common/home/u6007910/source/treemix-1.12/src/plotting_funcs.R")
plot_tree("out_treemix")