Annotation Pipeline Change Log

Pipeline v.5.1 (10/03/2022)

Reference Database and Version Information

COG version from 2003 with COG names and COG assignment to functional categories per 2014 update
Pfam V.34 (March 2021)
TIGRFam v.15.0
Cath-Funfam v.4.1.0
SuperFamily v.1.75
SMART 01_06_2016
KEGG V.98.0 (April 2021)
IMG NR 20211118

Pipeline Code Changes

Slight changes were made in each subversion to fix bugs, add additional error checking, or make minor improvements.

[5.1.11] - 2022-10-03

Fixed

Fixed a bug that caused the pipeline to keep running despite all sequences being filtered out during pre-QC.

[5.1.10] - 2022-09-28

Fixed

Fixed a bug that caused duplicate start_type attributes in the structural/functional annotation GFFs.

[5.1.9] - 2022-08-23

Fixed

Fixed a bug that in specific cases could cause an IndexError in the Contig.create_final_trusted_genes method of the cds_overlap_resolver.py script.

[5.1.8] - 2022-06-16

Changed

Instead of removing carriage returns they get replaced with newline characters in the pre-QC step now. After that we replace sequential newlines with only one occurrence.

[5.1.7] - 2022-06-01

Added

The config file now contains a translation_table entry that can get used to force the usage of a specific genetic code for the CDS prediction. The value 0 will mimic the previous behavior in which the pipeline tries to auto-predict the correct translation table.

[5.1.6] - 2022-05-26

Changed

Removing the files created by genemark's translation table runs from permanent output files set.

[5.1.5] - 2022-03-12

Changed

Fasta files not starting with a '>' character fail the pre-QC step.
If an empty line in the fasta file file is followed by a line not starting with '>' the pre-QC step fails.

[5.1.4] - 2022-02-25

Changed

Making sure that allowed overlapping CRISPR and CDS features are not starting at the same position.

[5.1.3] - 2022-02-12

Changed

The Rfam clan filter script now also removes the lower scoring of two overlapping hits to the same model, even if they are not in a clan.

[5.1.2] - 2022-02-10

Changed

Making sure GFF import RNAs also get a product name assignment.

[5.1.1] - 2022-01-04

Changed

Removed import of external python_toolbox and added tprint function to imgap_utils instead.

[5.1.0] - 2021-12-02

Changed

Completely rewrote CDS prediction and overlap resolution part of the structural annotation phase.
The filtering script for the LAST results now looks at the top n bitscores (instead of top n hits).

Pipeline v.5.0

Starting from Pipeline v.5.0, isolate genomes and metagenomes are processed using the same code base. Moreover, additional functions (Cath-Funfam, SuperFamily and SMART) have been added.

Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8269246/

Highlight of IMG Annotation Pipeline v.5.0: https://img.jgi.doe.gov/docs/pipelineV5/

Reference Database and Version Information

COG version from 2003 with COG names and COG assignment to functional categories per 2014 update
Pfam V.30 (February 2017)
TIGRFam v.15.0
Cath-Funfam v.4.1.0
SuperFamily v.1.75
SMART 01_06_2016
KEGG V.77.1 (February 2016)
IMG NR 20190607

Pipeline Code Changes

Slight changes were made in each subversion to fix bugs, add additional error checking, or make minor improvements.

## [5.0.25] - 2021-09-10

### Changed

- If Prodigal and GeneMark predict the same gene, but it gets shortened in two different places, only the shortened GeneMark gene is kept.

## [5.0.24] - 2021-08-20

### Changed

- The filtering script for the LAST results now makes an additional run over the output to check which subjects are actually needed in the MD5 lookup.

## [5.0.23] - 2021-02-02

### Added

- The GFF stats script also creates an additional output file in JSON format now.

## [5.0.22] - 2020-01-21

### Fixed

- Fixed bug that caused the Pfam e-value in the *_pfam.gff to always get set to 13.

### Added

- The cmsearch command for the Rfam step can now also contain a -Z argument that gets read from the annotation config.

## [5.0.21] - 2020-12-11

### Changed

- Pulling the /dev/null redirect out of the hmmsearch command variables (since it causes JAWS to fail).

## [5.0.20] - 2020-10-22

### Changed

- Set the default max overlap ratio back 0.1 for cath-funfam, cog, smart and superfamily.

- Making sure that new logs don't get deleted when removing files from a potentially failed previous run.

## [5.0.19] - 2020-07-28

### Added

- Added print out of non-IUPAC characters if any get detected in the pre-QC step.

## [5.0.18] - 2020-07-17

### Added

- Added number of genes per 1M bp check to the post-QC step.

## [5.0.17] - 2020-07-09

### Changed

- Increased the default minimum contig length from 150 to 200 bp.

## [5.0.16] - 2020-07-06

### Added

- Added check to make sure that the contig sequences only contain IUPAC letters before replacing all non-ACGTN cha

racters with Ns.

## [5.0.15] - 2020-04-03

### Changed

- The poly-N stretch length indicating a gap of unknown length can now be set via the config file. A 0 turns this

feature off.

## [5.0.14] - 2020-03-17

### Added

- Commands to remove tmp and results files at the beginning of every step that uses GNU parallel (in case there wa

s a previous run that got killed or failed due to non-pipeline related reasons).

## [5.0.13] - 2020-03-06

### Changed

- The abortion of the pipeline processing at the encounter of poly-N stretches of length 100 can now be turned on

and off via the config file.

## [5.0.12] - 2020-02-28

### Added

- The pre-QC step now removes leading/trailing Ns from the contigs' ends and checks for N-stretches of exactly 100 bp, which stand for gaps of unknown length. If such a gap exists the pipeline aborts processing and produces a GA

PS_OF_UNKNOWN_LENGTH.txt file that lists the contig names and start positions of these gaps of unknown length.

## [5.0.11] - 2020-01-27

### Changed

- The pre-QC step now creates a PRE_QC_FAILURE file in the input file directory if any of the QC rules got offended. The file lists the detailed reason.

## [5.0.10] - 2020-01-15

### Changed

- For metagenomes the tRNA prediction now also uses B and A models instead of the general one.

## [5.0.9] - 2019-11-28

### Changed

- Each hmmsearch module now executes multiple instances of hmmsearch in parallel. The number of parallel hmmsearch instances is now an entry in the yaml config file.

- The -Z argument was added to the command hmmsearch command lines. The value for the -Z argument is now also an entry in the config file.

## [5.0.8] - 2019-11-27

### Changed

- The minimum contig length used by the pre-qc scripts does now get set via the annotation config yaml.

## [5.0.7] - 2019-11-11

### Added

- Additional Warning line handling in tRNAscan parser.

## [5.0.6] - 2019-09-25

### Removed

- The locus_tag attribute from all GFF files.

## [5.0.5] - 2019-06-22

### Added

- The pre-QC and GFF and Fasta stats steps can now get turned on and off via the annotation config file.

## [5.0.4] - 2019-06-08

### Changed

- Prodigal: If an isolate has less than 20000 bp it will get processed in meta mode.

### Fixed

- Fixed some typos.

## [5.0.3] - 2019-06-02

### Added

- Tracking of which module got started last (for in-house purposes, to catch datasets that time out on a specific module).

### Fixed

- Fixed bug that in some cases caused duplicate genes, when one gene got shortened.

## [5.0.2] - 2019-05-24

### Fixed

- Fixed bug that in some cases caused SignalP or TMHMM results not to be present at the product name assignment step.

## [5.0.1] - 2019-04-04

### Added

- Added checkpointing information for every step of the pipeline, so that previously finished modules don't need to get executed again, when a job gets resubmitted.

## [5.0.0] - 2019-03-05

### Release

- New IMG Annotation Pipeline version that unifies isolate and metagenome processing.

Pipeline v.4

Previously IMG used slightly different pipelines to process isolate genomes and metagenomes.

Publications: