Post date: Mar 06, 2014 11:20:5 PM
I intend to use MAKER to annotate the current L. melissa draft genome (here is a tutorial). This will provide an indication of the quality of the current genome and could provide putative functional information for the genetic basis of larval performance in the experiments. My plan for genome annotation follow Victor's work on the Timema genome.
MARKER uses ab initio gene predictions and expression or protein sequence data for genome annotation. The ab initio gene predictions are used for initial annotations. I will use one of the two programs Victor used for this, SNAP (Semi-HMM-based Nucleic Acid Parser). SNAP implements a HMM that requires a training gene model.
Similar to Victor, I am using CEGMA (Core Eukaryotic Genes Mapping Approach) for this (here is the paper). Here is Victor's description of how it works and the results from Timema:
CEGMA is a pipeline to find orthologs of a set of 458 highly conserved (~universal) eukaryotic proteins and annotate them (i.e. determine exons and introns). It uses BLAST (tblastn) to identify candidate genes, and Genewise, HMMER, and geneid to refine gene structures. There is a subset of 248 ultraconserved genes that are used to calculate completeness statistics: It found 135 complete matches (54.4%), and 214 partial matches (86.29%) in the Timema cristinae genome. This is rather low in comparison to most other genomes, which show values in the range of 90-100%, but it is not the worst case anyway. I think the difference between complete and partial matches might be explained because we have a rather fragmented draft (i.e. many scaffolds) and it would be improved whether we would merge scaffolds in linkage groups.
I installed CEGMA and the other needed programs (this was a pain) and I am running cegma on my computer. The results will be in data/lycaeides/melissa_genome/Annotation/cegma/. Here is the command I ran (from Downloads/cegma_v2.4.010312/sample/)/:
cegma --genome ~/labs/evolution/data/lycaeides/melissa_genome/final.assembly.fasta
We had complete matches to 49.19% of the 248 CEGs and partial matches to 68.55 of them. This is a bit worse than for Timema, but not surprising as the Timema genome is in better shape.
I am also gathering protein and EST data. I have EST data from D. plexipus, H. erato and H. melpomene in data/lycaeides/melissa_genome/Annotation/est. The monarch data is from brain tissue, whereas the Heliconius data is from wing discs. I might consider combining these or just using the monarch data. The *est file has all of the metadata for the monarch ESTs. I also downloaded the UniRef50 data set (downloaded 6 March 2014), which is in data/lycaeides/melissa_genome/Annotation/protein.