IMG Annotation Version Change Log

IMG Annotation Version 2024a (12/08/2024)

This annotation version includes updates to Pfam v37.0, KEGG v111.1, IMG-NR v20240916, and the addition of the geNomad db v1.7. The geNomad database is now required for the newly added geNomad module in the IMG annotation pipeline, which identifies mobile genetic elements and prompted a minor version upgrade

All isolate genomes have been re-annotated using the updated Pfam and KEGG versions.

Reference Database Version Information

Cath-Funfam v.4.2.0
COG v2014
geNomad v1.7
IMG-NR v20240916
KEGG v111.1
Pfam v37.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0

IMG Annotation Pipeline Version and Change Log Information

IMGAP v5.3.x

Change Log:

[5.3.0] - 2024-12-11

Added

Added genomad/genomad.sh to run the geNomad tool. It'll run the container built from the public IMG geNomad container.
Added new container_run_command option in the config yaml. This will be used to execute containers (currently only used in the new genomad module).

Changed

Modified the yaml parser to properly handle values that contain whitespaces (required for the new container_run_command option in the config yaml).
The lastal call now specifically sets the -m argument. Refactored the options/arguments of its wrapper script to accommodate this change.

IMG Annotation Version 2023a (12/06/2023)

This annotation version upgrade includes a new IMG-NR and a minor version update to the IMG Annotation Pipeline, driven by the addition of a step for assigning phylogeny to contigs.

Reference Database Version Information

Cath-Funfam v4.2.0
COG v2014
IMG-NR v20230629
KEGG v98.0
Pfam v34.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0

IMG Annotation Pipeline Version and Change Log Information

IMGAP v5.2.x

Change Log:

[5.2.2] - 2024-12-11

Added

Added TMHMM version to decodeanhmm version.
Each module outputs its tool's version info.

Changed

The gff_and_final_fasta_stats.py output gets now routed to its own log file.

Fixed

Fixed bug that caused the HMMER version to be empty in the TIGRFAM GFF file.

Removed

Removed last newline from /usr/bin/time format.
Removed tee usage so that outputs get only routed to their respective log files.
Removed thread throttling from the tRNA module.

[5.2.1] - 2023-12-06

Changed

TMHMM model file does now get detected automatically.

Removed

TMHMM model entry in annotation config template.

[5.2.0] - 2023-12-06

Added

Added create_scaffold_lineage.py which creates a tsv file containing the consensus lineage for each scaffold based on the gene phylogeny tsv file.

Changed

Renamed lastal_img_nr_ko_ec_gene_phylo.sh to lastal_img_nr_ko_ec_gene_phylo_scaffold_lineage.sh.
The img_nr_ko_ec_gene_phylo.log got renamed to img_nr_ko_ec_gene_phylo_scaffold_lineage.log.
Changed split and replace characters in lastal_img_nr_ko_ec_gene_phylo_hit_selector.py to accommodate slight format changes made to the new IMG NR MD5 lookup file.

IMG Annotation Version 2021a (12/02/2021)

This annotation version includes updates to Pfam v34.0, KEGG v98.0, and a new IMG-NR. The IMG annotation pipeline received a minor version upgrade due to a complete rewrite of the CDS prediction module, including a new overlap resolution approach. Additionally, the pipeline now considers the top 5 bitscores instead of the top 5 hits.

All isolate genomes have been re-annotated using the updated Pfam and KEGG versions.

Reference Database Version Information

Cath-Funfam v4.2.0
COG v2014
IMG-NR v20211118
KEGG v98.0
Pfam v34.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0

IMG Annotation Pipeline Version and Change Log Information

IMGAP v5.1.x

Change Log:

[5.1.17] - 2023-06-01

Changed

The cmsearch and hmmsearch scripts will now create an empty output GFF file if no features got predicted.

[5.1.16] - 2023-05-30

Changed

Changed structural_annotation.sh so GFF files will only get added to gff_files_merger.py if they exist and are not empty.

Fixed

Fixed typo in gff_files_merger.py argument list.

[5.1.15] - 2023-05-18

Changed

Specifically stating now when Rfam did not predict any features.

[5.1.14] - 2023-04-03

Added

Added set and pipefail commands to bash scripts.

[5.1.13] - 2022-12-21

Changed

Made the contig name parsing more robust in pick_best_genemark_predictions.sh for non-JGI setups.

[5.1.12] - 2022-12-08

Added

Added finalize_fasta_files.py which filters all genes and proteins that are not in the final merged structural annotation GFF out of the CDS genes and proteins fasta files to create their final versions.

Changed

The final fasta files produced by the CDS prediction module now contain a 'cds_' in the filename and are the new basis (instead of the fasta files created by the GeneMark and Prodigal modules) for the final genes and proteins fasta files creation (via the newly added finalize_fasta_files.py).

[5.1.11] - 2022-10-03

Fixed

Fixed a bug that caused the pipeline to keep running despite all sequences being filtered out during pre-QC.

[5.1.10] - 2022-09-28

Fixed

Fixed a bug that caused duplicate start_type attributes in the structural/functional annotation GFFs.

[5.1.9] - 2022-08-23

Fixed

Fixed a bug that in specific cases could cause an IndexError in the Contig.create_final_trusted_genes method of the cds_overlap_resolver.py script.

[5.1.8] - 2022-06-16

Changed

Instead of removing carriage returns they get replaced with newline characters in the pre-QC step now. After that we replace sequential newlines with only one occurrence.

[5.1.7] - 2022-06-01

Added

The config file now contains a translation_table entry that can get used to force the usage of a specific genetic code for the CDS prediction. The value 0 will mimic the previous behavior in which the pipeline tries to auto-predict the correct translation table.

[5.1.6] - 2022-05-26

Changed

Removing the files created by genemark's translation table runs from permanent output files set.

[5.1.5] - 2022-03-12

Changed

Fasta files not starting with a '>' character fail the pre-QC step.
If an empty line in the fasta file is followed by a line not starting with '>' the pre-QC step fails.

[5.1.4] - 2022-02-25

Changed

Making sure that allowed overlapping CRISPR and CDS features are not starting at the same position.

[5.1.3] - 2022-02-12

Changed

The Rfam clan filter script now also removes the lower scoring of two overlapping hits to the same model, even if they are not in a clan.

[5.1.2] - 2022-02-10

Changed

Making sure GFF import RNAs also get a product name assignment.

[5.1.1] - 2022-01-04

Changed

Removed import of external python_toolbox and added tprint function to imgap_utils instead.

[5.1.0] - 2021-12-02

Changed

Completely rewrote CDS prediction and overlap resolution part of the structural annotation phase.
The filtering script for the LAST results now looks at the top n bitscores (instead of top n hits).

We will now combine the annotation database versions and the IMG Annotation Pipeline code version into a single designation called the IMG Annotation Version. Moving forward, there will be at most one major version update per year. These updates may include changes to the major or minor version number of the IMG Annotation Pipeline code (following semantic versioning) and/or updates to newer versions of the annotation databases.

Pipeline v5.0.x (03/05/2019)

Starting from Pipeline v5.0.0, isolate genomes and metagenomes are processed using the same code base. Moreover, additional functions (Cath-Funfam, SuperFamily and SMART) have been added.

Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8269246/

Highlight of IMG Annotation Pipeline v5.0.0: https://img.jgi.doe.gov/docs/pipelineV5/

Reference Database Version Information

Cath-Funfam v4.1.0/v4.2.0
COG v2014
IMG-NR v20190607
KEGG v77.1
Pfam v30
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0

IMG Annotation Pipeline Version and Change Log Information

IMGAP v5.0.x

Change Log:

[5.0.25] - 2021-09-10

Changed

If Prodigal and GeneMark predict the same gene, but it gets shortened in two different places, only the shortened GeneMark gene is kept.

[5.0.24] - 2021-08-20

Changed

The filtering script for the LAST results now makes an additional run over the output to check which subjects are actually needed in the MD5 lookup.

[5.0.23] - 2021-02-02

Added

The GFF stats script also creates an additional output file in JSON format now.

[5.0.22] - 2020-01-21

Fixed

Fixed bug that caused the Pfam e-value in the *_pfam.gff to always get set to 13.

Added

The cmsearch command for the Rfam step can now also contain a -Z argument that gets read from the annotation config.

[5.0.21] - 2020-12-11

Changed

Pulling the /dev/null redirect out of the hmmsearch command variables (since it causes JAWS to fail).

[5.0.20] - 2020-10-22

Changed

Set the default max overlap ratio back 0.1 for cath-funfam, cog, smart and superfamily.
Making sure that new logs don't get deleted when removing files from a potentially failed previous run.

[5.0.19] - 2020-07-28

Added

Added print out of non-IUPAC characters if any get detected in the pre-QC step.

[5.0.18] - 2020-07-17

Added

Added number of genes per 1M bp check to the post-QC step.

[5.0.17] - 2020-07-09

Changed

Increased the default minimum contig length from 150 to 200 bp.

[5.0.16] - 2020-07-06

Added

Added check to make sure that the contig sequences only contain IUPAC letters before replacing all non-ACGTN characters with Ns.

[5.0.15] - 2020-04-03

Changed

The poly-N stretch length indicating a gap of unknown length can now be set via the config file. A 0 turns this feature off.

[5.0.14] - 2020-03-17

Added

Commands to remove tmp and results files at the beginning of every step that uses GNU parallel (in case there was a previous run that got killed or failed due to non-pipeline related reasons).

[5.0.13] - 2020-03-06

Changed

The abortion of the pipeline processing at the encounter of poly-N stretches of length 100 can now be turned on and off via the config file.

[5.0.12] - 2020-02-28

Added

The pre-QC step now removes leading/trailing Ns from the contigs' ends and checks for N-stretches of exactly 100 bp, which stand for gaps of unknown length. If such a gap exists the pipeline aborts processing and produces a GAPS_OF_UNKNOWN_LENGTH.txt file that lists the contig names and start positions of these gaps of unknown length.

[5.0.11] - 2020-01-27

Changed

The pre-QC step now creates a PRE_QC_FAILURE file in the input file directory if any of the QC rules got offended. The file lists the detailed reason.

[5.0.10] - 2020-01-15

Changed

For metagenomes the tRNA prediction now also uses B and A models instead of the general one.

[5.0.9] - 2019-11-28

Changed

Each hmmsearch module now executes multiple instances of hmmsearch in parallel. The number of parallel hmmsearch instances is now an entry in the yaml config file.
The -Z argument was added to the command hmmsearch command lines. The value for the -Z argument is now also an entry in the config file.

[5.0.8] - 2019-11-27

Changed

The minimum contig length used by the pre-qc scripts does now get set via the annotation config yaml.

[5.0.7] - 2019-11-11

Added

Additional Warning line handling in tRNAscan parser.

[5.0.6] - 2019-09-25

Removed

The locus_tag attribute from all GFF files.

[5.0.5] - 2019-06-22

Added

The pre-QC and GFF and Fasta stats steps can now get turned on and off via the annotation config file.

[5.0.4] - 2019-06-08

Changed

Prodigal: If an isolate has less than 20000 bp it will get processed in meta mode.

Fixed

Fixed some typos.

[5.0.3] - 2019-06-02

Added

Tracking of which module got started last (for in-house purposes, to catch datasets that time out on a specific module).

Fixed

Fixed bug that in some cases caused duplicate genes, when one gene got shortened.

[5.0.2] - 2019-05-24

Fixed

Fixed bug that in some cases caused SignalP or TMHMM results not to be present at the product name assignment step.

[5.0.1] - 2019-04-04

Added

Added checkpoint information for every step of the pipeline, so that previously finished modules don't need to get executed again, when a job gets resubmitted.

[5.0.0] - 2019-03-05

Release

New IMG Annotation Pipeline version that unifies isolate and metagenome processing.

Pipeline v4.x

Previously IMG used slightly different pipelines to process isolate genomes and metagenomes.

Publications:

Page updated

Report abuse