Post date: Jun 12, 2014 10:43:12 PM
The 13,158 predicted protein sequences from maker are in /labs/evolution/data/lycaeides/melissa_genome/Annotation/maker_proteins (one fasta file per scaffold, note this matches the number of genic regions). The results from running interproscan (v 5.4-47) are in /labs/evolution/data/lycaeides/melissa_genome/Annotation/functional_annotations/. Here is an example of the command (repeated over all scaffolds):
/home/A01963476/Source/my_interproscan/interproscan-5.4-47.0/interproscan.sh -i /labs/evolution/data/lycaeides/melissa_genome/Annotation/maker_proteins/scaffold_9.maker.proteins.fasta -f xml,gff3 -goterms -iprlookup -d /home/A01963476/scratch/interproscan/
This includes the following number of matches to different databases (figured out with UNIX utilities, e.g., grep, cut):
3115 Coils
13464 Gene3D
109 Hamap
14713 Pfam
239 PIRSF
6448 PRINTS
1 ProDom
4464 ProSitePatterns
9029 ProSiteProfiles
10006 SMART
12501 SUPERFAMILY
372 TIGRFAM
I extracted the GO terms for the SNV's in this project using a slightly modified script from Victor, retrieve_GO_SNPs.pl. This and the primary results from the enrichment analyses are in /labs/evolution/projects/lycaeides_hostplant/funcAnnot. These files contain a summary of the GO terms, based on Pfam-A (one of the databases), for the gene nearest each SNV (unless there wasn't one on the same scaffold, then there is nothing). The SNV's were provided in the file SnpList.pl.
I then grabbed the SNV's with the top 0.1% model-averaged effect sizes for survival and weight for each treatment. The scaffold and positions of these are in the out_* files. I extracted the GO terms for these SNPs and summarized them (see summary* files). These give the number of trait-associated SNV's with different annotations (each locus can have multiple annotation). I thine used R to calculate binomial probabilities of enrichment using expectations based on all SNVs. I focused on molecular function as these had the most annotations and seemed most interesting. I tested for enrichment for functions that had at least three hits and saved information on those that were significant in two or more analyses. See the details below, and note that metal-ion binding shows up, just like in Timema!
metal-ion binding, survival in all (obs = 5, enrichment=7.94, p = 4.323e-5), ms (obs = 5, enrichment=7.85, p = 4.634e-5), sla (obs = 5, enrichment=8.16, p = 3.752e-5), sla x ms (obs = 5, enrichment=8.06, p = 4.0297e-5)
calcium-ion binding, gls x ms survival (obs = 3, enrich= 2.40, p = 0.0373), weight on Ms (obs = 9, enrich = 7.09, p=6.044e-7), similarly gla x ms survival calcium-dependent phosophlipid binding (obs = 3, enrich = 150.0, p = 5.556e-9)
ATP, survival all (obs = 3, enrichment=1.90, p=0.0215), ac (obs=5, enrichment=3.15, p=0.0052)
protein binding, wgt all (obs=9, enrich = 2.01, p=0.0135), ac (obs=9, enrich = 2.01, p=0.0135), sla (obs=9, enrich = 2.04, p=0.0123).