Post date: Jun 11, 2014 9:2:0 PM
Now that the structural annotation of the melissa genome is done, I am trying a functional annotation. Again, my plan is to basically mirror what Victor did for the T. cristinae genome. Here is his description from the Science paper OSM:
"Functional annotation was carried out using InterProScan 5.4.25.0 (49) . We scanned T. cristinae predicted proteins against 11 signature databases: COILS 2.2, Gene3D 3.5.0, PANTHER 8.1, Pfam- A 27.0, PIRSF 2.84, PRINTS 42.0, ProDom 2006.1, PROSITE 20.97, SMART 6.2, SUPERFAMILY 1.3, and TIGRFAMs 13.0. The scan of the 44,292 protein sequences yielded a total of 147,560 matches distributed as follows: 3,808 hits for COILS, 22,294 for Gene3D, 30,096 for PANTHER, 25,474 for Pfam-A, 416 for PIRSF, 9498 for PRINTS, 3 for ProDom, 20,084 for PROSITE, 13,643 for SMART, 21,474 for SUPERFAMILY, and 770 for TIGRFAMs. We found 23,083 predicted proteins that had at least one match with any of the databases. We limited further functional analyses to Pfam-A matches, because this database is characterized by high quality, manually curated entries. We extracted the Gene Ontology (GO) terms that mapped to the Pfam domains that matched the predicted protein."
Along these lines, I downloaded InterProScan-5.4-47.0 and installed it in the source directory on the dorc cluster. I then collected all of the maker protein sequences from matched genes (these were in scaffold_*.maker.proteins.fasta files that all now reside in /labs/evolution/data/lycaeides/melissa_genome/Annotation/maker_proteins/) and used these as input for interproscan. Here is an example for one scaffold:
/home/A01963476/Source/my_interproscan/interproscan-5.4-47.0/interproscan.sh -i /labs/evolution/data/lycaeides/melissa_genome/Annotation/maker_proteins/scaffold_998.maker.proteins.fasta -f xml,gff3 -goterms -iprlookup -d /home/A01963476/scratch/interproscan/
The results will be written to /home/A01963476/scratch/interproscan/. This includes over 4000 jobs!