Functional Annotation

Updated on May 2, 2017

FolderDir:/labs/evolution/projects/Lmelissa_host_adaptation/funcAnnot

Important files:

bfmeansWithScafPosition_all, east, west- File contains bayesfactor means with scaffold and position from bayPass analysis. #206,028 SNPs
top58snps_scafPosBayesfactors : details for 58 shared SNPs between east and west populations. The file contains scaffold, positions and all, east and west bayesfactors for these SNPs from bayPass analysis.
top58snps : contains scaffold, position and all bayesfactors. Used for getting functional annotation data for just these 58 SNPs.
funcannotfile.txt - File contains Functional Annotation data for all 206,048 snps
bf_funcAnnot_combine.py - Python script to combine the above two files.
all_funcAnnot_bfmeans, west_funcAnnot_bfmeans, east_funcAnnot_bfmeans - Output files of the script above with combination of bfmeans and funcAnnotn data. NA is substituted for 19 SNPs for which the bayesfactors are missing. "top" file is with details for 58 shared SNPs NA is put for snps for which there are no bfmeans.
all_mainfile,west_mainfile,east_mainfile,top_OnlybfmeansandFuncAnnotdata.txt- This file has functional annotation data for the snps for which we have bayesfactor means (206,028 SNPs). All NA are removed using Step 0 in R below. Also, the bfmeans are dropped from these files as we do not need them for final annotation.
all_top0.1-0.0001, west_top0.1-0.0001, east_top0.1-0001 : these files contain informations for top 1%,0.01%,0.001% and 0.0001% bayesfactors SNPs. These are the same cut offs I used for parallel SNPs and structural annotation. These files will be annotated seperately.
Running perl script for getting the functional files:
03_table_SNPs_GO.pl - Perl script to generate the bioprocess, cellularprocess and molecular functions file
04_randomization_test.R: R script run from command line to get do a randomization test to Estimate a null frequency distribution for each GO term by sampling

As I could not install the dependencies for the perl script on the UofU cluster, I copied these files on my desktop and ran the analysis on my computer. These files are saved in Desktop/melissa_hostAdapt/funcAnnot. This folder has the main input files, perl script, randomization R script and the following output files for all, east, west and top 58 SNPs.

Output files for file with all the snps (background file for randomisation test):

8). OnlybfmeansandFuncAnnotdata.biopro.dsv

9). OnlybfmeansandFuncAnnotdata.celcom.dsv

10). OnlybfmeansandFuncAnnotdata.molfun.dsv

#Running R script for randomisation test on molecular function snps:

14). 04_randomization_test.R

15). mf/bp/cc_go.enrichment.txt - Output files after running the script above, containing all GO enrichment data

16). mf/bp/cc_go_enrichment_pvalues - contain pvalues from go enrichment from the randomization test

STEP 0 (done on the cluster)

#preparing files for top SNPs in top quantiles

#getting the top quantiles for the snps

all = read.table("all_funcAnnot_bfmeans", header=T)

east = read.table("east_funcAnnot_bfmeans", header=T)

west = read.table("west_funcAnnot_bfmeans", header=T)

#saving new files without NA and dropping the bfmeans column because we don't need that for final annotation

#all populations

x.all = data.frame(all)

sum(is.na(x.all$bfmeans)) #19

cc = is.na(x.all$bfmeans)

m = which(cc==c("TRUE"))

DF.all = x.all[-m,]

dim(DF.all) #206028 9

write.table(DF.all[,-9], file = "all_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

#western populations

x.west = data.frame(west)

sum(is.na(x.west$bfmeans)) #19

cc = is.na(x.west$bfmeans)

m = which(cc==c("TRUE"))

DF.west = x.west[-m,]

dim(DF.west) #206028 9

write.table(DF.west[,-9], file = "west_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

#eastern populations

x.east = data.frame(east)

sum(is.na(x.east$bfmeans)) #19

cc = is.na(x.east$bfmeans)

m = which(cc==c("TRUE"))

DF.east = x.east[-m,]

write.table(DF.east[,-9], file = "east_mainfile", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

#getting quantiles for all, western and eastern populations

westbf = DF.west$bfmeans

allbf = DF.all$bfmeans

eastbf = DF.east$bfmeans

bndsw = quantile(westbf, probs = c(0.99,0.999,0.9999,0.9))

bndse = quantile(eastbf, probs = c(0.99,0.999,0.9999,0.9))

bndsa = quantile(allbf, probs = c(0.99,0.999,0.9999,0.9))

#same numbers for east, all and west

sum(westbf > bndsw[1]) #2061

sum(westbf > bndsw[2]) #207

sum(westbf > bndsw[3]) #21

sum(westbf > bndsw[4]) #20603

#subset data to get files for only the top quantiles for all, east and west

all1 = subset(DF.all, DF.all$bfmeans > bndsa[1]) #top 0.01 %

all2 = subset(DF.all, DF.all$bfmeans > bndsa[2]) #top 0.001 %

all3 = subset(DF.all, DF.all$bfmeans > bndsa[3]) #top 0.0001 %

all4 = subset(DF.all, DF.all$bfmeans > bndsa[4]) #top 0.1 %

write.table(all1[,-9], file = "all_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

write.table(all2[,-9], file = "all_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

write.table(all3[,-9], file = "all_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

write.table(all4[,-9], file = "all_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

west1 = subset(DF.west, DF.west$bfmeans > bndsw[1]) #top 0.01 %

west2 = subset(DF.west, DF.west$bfmeans > bndsw[2]) #top 0.001 %

west3 = subset(DF.west, DF.west$bfmeans > bndsw[3]) #top 0.0001 %

west4 = subset(DF.west, DF.west$bfmeans > bndsw[4]) #top 0.1 %

write.table(west1[,-9], file = "west_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)

write.table(west2[,-9], file = "west_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)

write.table(west3[,-9], file = "west_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

write.table(west4[,-9], file = "west_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

east1 = subset(DF.east, DF.east$bfmeans > bndse[1]) #top 0.01 %

east2 = subset(DF.east, DF.east$bfmeans > bndse[2]) #top 0.001 %

east3 = subset(DF.east, DF.east$bfmeans > bndse[3]) #top 0.0001 %

east4 = subset(DF.east, DF.east$bfmeans > bndse[4]) #top 0.1 %

write.table(east1[,-9], file = "east_top0.01", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)

write.table(east2[,-9], file = "east_top0.001", quote=FALSE, sep=" ", row.names=FALSE, col.names= TRUE)

write.table(east3[,-9], file = "east_top0.0001", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

write.table(east4[,-9], file = "east_top0.1", quote=FALSE, sep=" ", row.names=FALSE, col.names=TRUE)

Doing functional enrichments:

STEP 1 (done on my desktop)

Used the script 03_table_SNPs_GO.pl

Summarise the GO terms present in the list of SNPs and output a table of GO terms sorted by number of counts

Syntax:

03_table_SNPs_GO.pl

-a <input file: SNPs functional annotations>

-hier consider GO hierarchy when counting GOs (optional)

-gom <local|remote> GO database mirror (default=local)

-bp output biological process GOs (optional)

-cc output cellular component GOs (optional)

-mf output molecular function GOs

-all output also parental GOs not on the list (optional)

Output:

1-3 files, depending on GO families (bp, cc, mf)

Each file contains:

6 commented lines:

Total number of unique GO terms

Total number of unique molecular function GO terms

Total number of GO terms

Total number of molecular function GO terms

Total number of SNPs with at least one GO term

Total number of SNPs with at least one molecular function GO term

followed by a table with 3 columns: GO, nSNPs, description

Arguments and file used/arguments given when I ran this script:

'a=s' => \$funcannot_file, # list of functional annotations = OnlybfmeansandFuncAnnotdata.txt

'hier' => \$hier, # consider hierarchy when counting GO terms

'gom=s' => \$gomirror, # GO database to use: local or remote (ebi) = -gom remote

'bp' => \$biopro, # output biological process terms = OnlybfmeansandFuncAnnotdata.biopro.dsv

'cc' => \$celcom, # output cellular component terms = OnlybfmeansandFuncAnnotdata.celcom.dsv

'mf' => \$molfun, # output molecular function terms = OnlybfmeansandFuncAnnotdata.molfun.dsv

'all' => \$all, # output counts even for parent GO terms not present on the list

#RUNNING FOR FILE WITH ALL SNPS

perl 03_table_SNPs_GO.pl -gom remote -a all_OnlybfmeansandFuncAnnotdata.txt -bp snpFuncAnnot_bio -cc snpFuncAnnot_cell -mf snpFuncAnnot_molFun

STEP 2:

Used the script 04_randomization_test.R

Estimate a null frequency distribution for each GO term by sampling

SNPs without replacement from the background list

Syntax:

04_randomization_test.R

There are 8 arguments that must be passed in order:

1.- working directory

2.- SNPs GO table - test

3.- SNPs GO table - background

4.- output

5.- minimum number of SNPs required for testing

6.- number of replicates

7.- number of cores

8.- type of GO term (bp, cc, mf, all)

Output:

9 columns:

1 GO

2 GO.text

3 nobs

4 totobs

5 nexp

6 totexp

7 pobs

8 pexp

9 pval

#RUNNING THE SCRIPT ON COMMAND LINE

R CMD BATCH ./04_randomization_test.R /labs/evolution/projects/Lmelissa_host_adaptation/funcAnnot/ TopbfSnpsFuncAnnotdata_500.molfun.dsv OnlybfmeansandFuncAnnotdata.molfun.dsv output_enrichment.txt 2 1000 8 mf

Page updated

Google Sites

Report abuse