Updated on May 2, 2017
FolderDir:/labs/evolution/projects/Lmelissa_host_adaptation/funcAnnot
Important files:
bfmeansWithScafPosition_all, east, west- File contains bayesfactor means with scaffold and position from bayPass analysis. #206,028 SNPs
top58snps_scafPosBayesfactors : details for 58 shared SNPs between east and west populations. The file contains scaffold, positions and all, east and west bayesfactors for these SNPs from bayPass analysis.
top58snps : contains scaffold, position and all bayesfactors. Used for getting functional annotation data for just these 58 SNPs.
funcannotfile.txt - File contains Functional Annotation data for all 206,048 snps
bf_funcAnnot_combine.py - Python script to combine the above two files.
all_funcAnnot_bfmeans, west_funcAnnot_bfmeans, east_funcAnnot_bfmeans - Output files of the script above with combination of bfmeans and funcAnnotn data. NA is substituted for 19 SNPs for which the bayesfactors are missing. "top" file is with details for 58 shared SNPs NA is put for snps for which there are no bfmeans.
all_mainfile,west_mainfile,east_mainfile,top_OnlybfmeansandFuncAnnotdata.txt- This file has functional annotation data for the snps for which we have bayesfactor means (206,028 SNPs). All NA are removed using Step 0 in R below. Also, the bfmeans are dropped from these files as we do not need them for final annotation.
all_top0.1-0.0001, west_top0.1-0.0001, east_top0.1-0001 : these files contain informations for top 1%,0.01%,0.001% and 0.0001% bayesfactors SNPs. These are the same cut offs I used for parallel SNPs and structural annotation. These files will be annotated seperately.
Running perl script for getting the functional files:
03_table_SNPs_GO.pl - Perl script to generate the bioprocess, cellularprocess and molecular functions file
04_randomization_test.R: R script run from command line to get do a randomization test to Estimate a null frequency distribution for each GO term by sampling
As I could not install the dependencies for the perl script on the UofU cluster, I copied these files on my desktop and ran the analysis on my computer. These files are saved in Desktop/melissa_hostAdapt/funcAnnot. This folder has the main input files, perl script, randomization R script and the following output files for all, east, west and top 58 SNPs.
Output files for file with all the snps (background file for randomisation test):
8). OnlybfmeansandFuncAnnotdata.biopro.dsv
9). OnlybfmeansandFuncAnnotdata.celcom.dsv
10). OnlybfmeansandFuncAnnotdata.molfun.dsv
#Running R script for randomisation test on molecular function snps:
14). 04_randomization_test.R
15). mf/bp/cc_go.enrichment.txt - Output files after running the script above, containing all GO enrichment data
16). mf/bp/cc_go_enrichment_pvalues - contain pvalues from go enrichment from the randomization test
STEP 0 (done on the cluster)
#preparing files for top SNPs in top quantiles
#getting the top quantiles for the snps
all = read.table("all_funcAnnot_bfmeans", header=T)
east = read.table("east_funcAnnot_bfmeans", header=T)
west = read.table("west_funcAnnot_bfmeans", header=T)
#saving new files without NA and dropping the bfmeans column because we don't need that for final annotation
#all populations
x.all = data.frame(all)
sum(is.na(x.all$bfmeans)) #19
cc = is.na(x.all$bfmeans)
m = which(cc==c("TRUE"))
DF.all = x.all[-m,]
dim(DF.all) #206028 9
write.table(DF.all[,-9], file = "all_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
#western populations
x.west = data.frame(west)
sum(is.na(x.west$bfmeans)) #19
cc = is.na(x.west$bfmeans)
m = which(cc==c("TRUE"))
DF.west = x.west[-m,]
dim(DF.west) #206028 9
write.table(DF.west[,-9], file = "west_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
#eastern populations
x.east = data.frame(east)
sum(is.na(x.east$bfmeans)) #19
cc = is.na(x.east$bfmeans)
m = which(cc==c("TRUE"))
DF.east = x.east[-m,]
write.table(DF.east[,-9], file = "east_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
#getting quantiles for all, western and eastern populations
westbf = DF.west$bfmeans
allbf = DF.all$bfmeans
eastbf = DF.east$bfmeans
bndsw = quantile(westbf, probs = c(0.99,0.999,0.9999,0.9))
bndse = quantile(eastbf, probs = c(0.99,0.999,0.9999,0.9))
bndsa = quantile(allbf, probs = c(0.99,0.999,0.9999,0.9))
#same numbers for east, all and west
sum(westbf > bndsw[1]) #2061
sum(westbf > bndsw[2]) #207
sum(westbf > bndsw[3]) #21
sum(westbf > bndsw[4]) #20603
#subset data to get files for only the top quantiles for all, east and west
all1 = subset(DF.all, DF.all$bfmeans > bndsa[1]) #top 0.01 %
all2 = subset(DF.all, DF.all$bfmeans > bndsa[2]) #top 0.001 %
all3 = subset(DF.all, DF.all$bfmeans > bndsa[3]) #top 0.0001 %
all4 = subset(DF.all, DF.all$bfmeans > bndsa[4]) #top 0.1 %
write.table(all1[,-9], file = "all_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
write.table(all2[,-9], file = "all_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
write.table(all3[,-9], file = "all_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
write.table(all4[,-9], file = "all_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
west1 = subset(DF.west, DF.west$bfmeans > bndsw[1]) #top 0.01 %
west2 = subset(DF.west, DF.west$bfmeans > bndsw[2]) #top 0.001 %
west3 = subset(DF.west, DF.west$bfmeans > bndsw[3]) #top 0.0001 %
west4 = subset(DF.west, DF.west$bfmeans > bndsw[4]) #top 0.1 %
write.table(west1[,-9], file = "west_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)
write.table(west2[,-9], file = "west_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)
write.table(west3[,-9], file = "west_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
write.table(west4[,-9], file = "west_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
east1 = subset(DF.east, DF.east$bfmeans > bndse[1]) #top 0.01 %
east2 = subset(DF.east, DF.east$bfmeans > bndse[2]) #top 0.001 %
east3 = subset(DF.east, DF.east$bfmeans > bndse[3]) #top 0.0001 %
east4 = subset(DF.east, DF.east$bfmeans > bndse[4]) #top 0.1 %
write.table(east1[,-9], file = "east_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)
write.table(east2[,-9], file = "east_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)
write.table(east3[,-9], file = "east_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
write.table(east4[,-9], file = "east_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)
Doing functional enrichments:
STEP 1 (done on my desktop)
Used the script 03_table_SNPs_GO.pl
Summarise the GO terms present in the list of SNPs and output a table of GO terms sorted by number of counts
Syntax:
03_table_SNPs_GO.pl
-a <input file: SNPs functional annotations>
-hier consider GO hierarchy when counting GOs (optional)
-gom <local|remote> GO database mirror (default=local)
-bp output biological process GOs (optional)
-cc output cellular component GOs (optional)
-mf output molecular function GOs
-all output also parental GOs not on the list (optional)
Output:
1-3 files, depending on GO families (bp, cc, mf)
Each file contains:
6 commented lines:
Total number of unique GO terms
Total number of unique molecular function GO terms
Total number of GO terms
Total number of molecular function GO terms
Total number of SNPs with at least one GO term
Total number of SNPs with at least one molecular function GO term
followed by a table with 3 columns: GO, nSNPs, description
Arguments and file used/arguments given when I ran this script:
'a=s' => \$funcannot_file, # list of functional annotations = OnlybfmeansandFuncAnnotdata.txt
'hier' => \$hier, # consider hierarchy when counting GO terms
'gom=s' => \$gomirror, # GO database to use: local or remote (ebi) = -gom remote
'bp' => \$biopro, # output biological process terms = OnlybfmeansandFuncAnnotdata.biopro.dsv
'cc' => \$celcom, # output cellular component terms = OnlybfmeansandFuncAnnotdata.celcom.dsv
'mf' => \$molfun, # output molecular function terms = OnlybfmeansandFuncAnnotdata.molfun.dsv
'all' => \$all, # output counts even for parent GO terms not present on the list
#RUNNING FOR FILE WITH ALL SNPS
perl 03_table_SNPs_GO.pl -gom remote -a all_OnlybfmeansandFuncAnnotdata.txt -bp snpFuncAnnot_bio -cc snpFuncAnnot_cell -mf snpFuncAnnot_molFun
STEP 2:
Used the script 04_randomization_test.R
Estimate a null frequency distribution for each GO term by sampling
SNPs without replacement from the background list
Syntax:
04_randomization_test.R
There are 8 arguments that must be passed in order:
1.- working directory
2.- SNPs GO table - test
3.- SNPs GO table - background
4.- output
5.- minimum number of SNPs required for testing
6.- number of replicates
7.- number of cores
8.- type of GO term (bp, cc, mf, all)
Output:
9 columns:
1 GO
2 GO.text
3 nobs
4 totobs
5 nexp
6 totexp
7 pobs
8 pexp
9 pval
#RUNNING THE SCRIPT ON COMMAND LINE
R CMD BATCH ./04_randomization_test.R /labs/evolution/projects/Lmelissa_host_adaptation/funcAnnot/ TopbfSnpsFuncAnnotdata_500.molfun.dsv OnlybfmeansandFuncAnnotdata.molfun.dsv output_enrichment.txt 2 1000 8 mf