Annotation Pipeline v.5.x

See our Change Log for the latest version information and updates.

Introduction

The main changes to the pipeline are related to the unification of genome and metagenome annotation protocols, code changes to ensure the modularity, scalability and portability of the pipeline, as well as better compliance with GenBank submission requirements. As a result, the content of annotations generated by the old and new version of the pipeline is largely the same. The most important changes to the genome and metagenome annotation content are summarized below, additional details can be provided upon request.

Links to previous version of pipeline:

If you have any questions please contact us.

Structural Annotation

The same annotation protocol for structural annotation is now applied to genome and metagenome datasets. The software tools and databases used include:

INFERNAL 1.1.2 search against Rfam 13.0 database (excluding tRNA and CRISPR models) to identify structural RNAs and regulatory motifs, such as riboswitches;
GeneMark.hmm-2 v1.05 and Prodigal V2.6.3 to identify protein-coding genes (CDSs);
tRNAscan-SE 2.0.4 to identify tRNAs;
CRT 1.8.2 to predict CRISPR arrays.

As compared to the old pipeline, feature type (column 3 of gff files) of the features identified by searching against Rfam is fully compliant with INSDC and GenBank requirements, and includes rRNA, tmRNA, ncRNA, misc_feature, misc_binding and regulatory types, with all attendant attributes. tRNA prediction has been upgraded to the latest version of tRNAscan-SE, which has higher accuracy than tRNAscan-SE 1.0.

Differences between genome and metagenome annotation

For genome annotation tRNAscan is run in 'bacterial' and 'archaeal' mode for each dataset, and the model generating more tRNA predictions with known isotype is chosen on a per contig basis (to accommodate SAGs and MAGs with possible contamination issues). For metagenome annotation tRNAscan is run in 'general' mode due to scalability issues. Based on our benchmarking, the 'general' mode of tRNAscan-SE 2.0.4 has lower sensitivity on archaeal contigs than the 'archaeal' mode, but still higher than tRNAscan-SE 1.0.

Functional Annotation

The same functional annotation protocol is now applied to genome and metagenome datasets. Functional annotation includes assignment of protein-coding genes to the following 3D fold and functional protein families:

COG (version from 2003 with COG names and COG assignment to functional categories per 2014 update)
Pfam v34
TIGRFAM v15.0
Cath-Funfam v4.1.0
SuperFamily v1.75
SMART 01_06_2016
KEGG Orthology (KO) Terms v98.0
Enzyme Commission (EC) numbers derived from KO Term assignments

With the exception of KEGG Orthology, all other assignments are done using hmmsearch from HMMER 3.1b2 package, with model-specific trusted cutoff for Pfam, noise cutoff for TIGRFAM or with --domE 0.01 cutoff for the rest of the families. KEGG Orthology Terms are assigned using lastal 983 against KEGG Genes v77.1 to assign KO Terms to IMG-NR genes, which is then used to assign KO Terms to the rest of the genes. IMG-NR version used for annotation is reported in dataset-specific 'sigs_anntoation_parameters' file.

As compared to old genome annotation, the changes include:

assignment of COGs using hmmsearch instead of RPS-BLAST using a set of HMMs generated from the original COG multiple sequence alignments (v. 2003) resulting in higher sensitivity of assignments
assignment of 3D fold families (Cath-Funfam, SuperFamily and SMART) including all models rather than subsets available in InterPro, including availability of alignment parameters

As compared to old metagenome annotation, the changes include:

assignment of 3D fold families (Cath-Funfam, SuperFamily and SMART) with alignment parameters
assignment of TIGRFAM families

Differences between genome and metagenome annotation

None.

Topological annotation of protein coding genes (signal peptides and transmembrane regions)

No changes were made. Signalp 4.1 and TMHMM2.0c are used.

Differences between genome and metagenome annotation

Only genome datasets are annotated.

Taxonomic annotation of protein coding genes ('phylogenetic distribution')

No changes were made. lastal 983 is used, with reference database (IMG-NR) version reported in 'sigs_annotation_parameters' files.

Differences between genome and metagenome annotation

None.

Possibility of reannotation of legacy datasets with pipeline v.5.0.0.

Even though the pipeline v5.0.0 generates richer annotations, especially for metagenomes, the changes to the annotation content shared with an old pipeline (structural annotations, COGs, Pfams, KO terms, taxonomic assignments) are small or non-existent. Therefore we anticipate that the switch to the new pipeline will NOT result in biases in the analysis workflows, whereby the datasets will be clustering into artificial groups based on the pipeline version rather than biological features. However, if you observe such effects, efforts will be made to bring all datasets in your study to the same annotation baseline. These will have to be arranged through your Project Manager to ensure proper tracking and prevent conflicts with the primary annotation queue.

Availability of the data generated by IMG annotation pipeline v.5.0.0.

Gff3-format files provided as part of IMG pipeline output (file _functional_annotation.gff) and as part of an IMG tarball (file *.gff) include all structural and functional annotations generated for each dataset. These gff files provide an entire set of predicted features and the majority of feature attributes with the exception of tool-specific output and alignment parameters. Raw output of tools for each type of functional and structural annotation, as well as gff files for each type of annotation, which include alignment coordinates, bit scores, etc. and tab delimited files for the same can be found as part of IMG pipeline output (identified by GOLD Analysis Project id). In addition tab-delimited files summarizing alignment parameters can be found in IMG tarballs. All files from the IMG pipeline output and the IMG tarball are available for download from the JGI Data Portal in 'IMG Data' directory of the respective sequencing project. For metagenomes, additional information about specific pipeline, software tool and database versions, as well as annotation summaries, can be found in 'Metagenome Report Tables' directory. For genomes, a GenBank-format file will no longer be generated. Full annotations are provided through web graphical user interface (GUI) for genome datasets. Web GUI shows only TIGRfam annotations for metagenome datasets. 3D fold annotations (Cath-Funfam, SuperFamily and SMART) for metagenomes will be shown in the future, after upgrades to the database schema and GUI.

Page updated

Report abuse