Annotation Pipeline Change Log
Pipeline v.5.1 (10/03/2022)
Reference Database and Version Information
COG version from 2003 with COG names and COG assignment to functional categories per 2014 update
Pfam V.34 (March 2021)
TIGRFam v.15.0
Cath-Funfam v.4.1.0
SuperFamily v.1.75
SMART 01_06_2016
KEGG V.98.0 (April 2021)
IMG NR 20211118
Pipeline Code Changes
Slight changes were made in each subversion to fix bugs, add additional error checking, or make minor improvements.
[5.1.11] - 2022-10-03
Fixed
Fixed a bug that caused the pipeline to keep running despite all sequences being filtered out during pre-QC.
[5.1.10] - 2022-09-28
Fixed
Fixed a bug that caused duplicate start_type attributes in the structural/functional annotation GFFs.
[5.1.9] - 2022-08-23
Fixed
Fixed a bug that in specific cases could cause an IndexError in the Contig.create_final_trusted_genes method of the cds_overlap_resolver.py script.
[5.1.8] - 2022-06-16
Changed
Instead of removing carriage returns they get replaced with newline characters in the pre-QC step now. After that we replace sequential newlines with only one occurrence.
[5.1.7] - 2022-06-01
Added
The config file now contains a translation_table entry that can get used to force the usage of a specific genetic code for the CDS prediction. The value 0 will mimic the previous behavior in which the pipeline tries to auto-predict the correct translation table.
[5.1.6] - 2022-05-26
Changed
Removing the files created by genemark's translation table runs from permanent output files set.
[5.1.5] - 2022-03-12
Changed
Fasta files not starting with a '>' character fail the pre-QC step.
If an empty line in the fasta file file is followed by a line not starting with '>' the pre-QC step fails.
[5.1.4] - 2022-02-25
Changed
Making sure that allowed overlapping CRISPR and CDS features are not starting at the same position.
[5.1.3] - 2022-02-12
Changed
The Rfam clan filter script now also removes the lower scoring of two overlapping hits to the same model, even if they are not in a clan.
[5.1.2] - 2022-02-10
Changed
Making sure GFF import RNAs also get a product name assignment.
[5.1.1] - 2022-01-04
Changed
Removed import of external python_toolbox and added tprint function to imgap_utils instead.
[5.1.0] - 2021-12-02
Changed
Completely rewrote CDS prediction and overlap resolution part of the structural annotation phase.
The filtering script for the LAST results now looks at the top n bitscores (instead of top n hits).
Pipeline v.5.0
Starting from Pipeline v.5.0, isolate genomes and metagenomes are processed using the same code base. Moreover, additional functions (Cath-Funfam, SuperFamily and SMART) have been added.
Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8269246/
Highlight of IMG Annotation Pipeline v.5.0: https://img.jgi.doe.gov/docs/pipelineV5/
Reference Database and Version Information
COG version from 2003 with COG names and COG assignment to functional categories per 2014 update
Pfam V.30 (February 2017)
TIGRFam v.15.0
Cath-Funfam v.4.1.0
SuperFamily v.1.75
SMART 01_06_2016
KEGG V.77.1 (February 2016)
IMG NR 20190607
Pipeline Code Changes
Slight changes were made in each subversion to fix bugs, add additional error checking, or make minor improvements.
## [5.0.25] - 2021-09-10
### Changed
- If Prodigal and GeneMark predict the same gene, but it gets shortened in two different places, only the shortened GeneMark gene is kept.
## [5.0.24] - 2021-08-20
### Changed
- The filtering script for the LAST results now makes an additional run over the output to check which subjects are actually needed in the MD5 lookup.
## [5.0.23] - 2021-02-02
### Added
- The GFF stats script also creates an additional output file in JSON format now.
## [5.0.22] - 2020-01-21
### Fixed
- Fixed bug that caused the Pfam e-value in the *_pfam.gff to always get set to 13.
### Added
- The cmsearch command for the Rfam step can now also contain a -Z argument that gets read from the annotation config.
## [5.0.21] - 2020-12-11
### Changed
- Pulling the /dev/null redirect out of the hmmsearch command variables (since it causes JAWS to fail).
## [5.0.20] - 2020-10-22
### Changed
- Set the default max overlap ratio back 0.1 for cath-funfam, cog, smart and superfamily.
- Making sure that new logs don't get deleted when removing files from a potentially failed previous run.
## [5.0.19] - 2020-07-28
### Added
- Added print out of non-IUPAC characters if any get detected in the pre-QC step.
## [5.0.18] - 2020-07-17
### Added
- Added number of genes per 1M bp check to the post-QC step.
## [5.0.17] - 2020-07-09
### Changed
- Increased the default minimum contig length from 150 to 200 bp.
## [5.0.16] - 2020-07-06
### Added
- Added check to make sure that the contig sequences only contain IUPAC letters before replacing all non-ACGTN cha
racters with Ns.
## [5.0.15] - 2020-04-03
### Changed
- The poly-N stretch length indicating a gap of unknown length can now be set via the config file. A 0 turns this
feature off.
## [5.0.14] - 2020-03-17
### Added
- Commands to remove tmp and results files at the beginning of every step that uses GNU parallel (in case there wa
s a previous run that got killed or failed due to non-pipeline related reasons).
## [5.0.13] - 2020-03-06
### Changed
- The abortion of the pipeline processing at the encounter of poly-N stretches of length 100 can now be turned on
and off via the config file.
## [5.0.12] - 2020-02-28
### Added
- The pre-QC step now removes leading/trailing Ns from the contigs' ends and checks for N-stretches of exactly 100 bp, which stand for gaps of unknown length. If such a gap exists the pipeline aborts processing and produces a GA
PS_OF_UNKNOWN_LENGTH.txt file that lists the contig names and start positions of these gaps of unknown length.
## [5.0.11] - 2020-01-27
### Changed
- The pre-QC step now creates a PRE_QC_FAILURE file in the input file directory if any of the QC rules got offended. The file lists the detailed reason.
## [5.0.10] - 2020-01-15
### Changed
- For metagenomes the tRNA prediction now also uses B and A models instead of the general one.
## [5.0.9] - 2019-11-28
### Changed
- Each hmmsearch module now executes multiple instances of hmmsearch in parallel. The number of parallel hmmsearch instances is now an entry in the yaml config file.
- The -Z argument was added to the command hmmsearch command lines. The value for the -Z argument is now also an entry in the config file.
## [5.0.8] - 2019-11-27
### Changed
- The minimum contig length used by the pre-qc scripts does now get set via the annotation config yaml.
## [5.0.7] - 2019-11-11
### Added
- Additional Warning line handling in tRNAscan parser.
## [5.0.6] - 2019-09-25
### Removed
- The locus_tag attribute from all GFF files.
## [5.0.5] - 2019-06-22
### Added
- The pre-QC and GFF and Fasta stats steps can now get turned on and off via the annotation config file.
## [5.0.4] - 2019-06-08
### Changed
- Prodigal: If an isolate has less than 20000 bp it will get processed in meta mode.
### Fixed
- Fixed some typos.
## [5.0.3] - 2019-06-02
### Added
- Tracking of which module got started last (for in-house purposes, to catch datasets that time out on a specific module).
### Fixed
- Fixed bug that in some cases caused duplicate genes, when one gene got shortened.
## [5.0.2] - 2019-05-24
### Fixed
- Fixed bug that in some cases caused SignalP or TMHMM results not to be present at the product name assignment step.
## [5.0.1] - 2019-04-04
### Added
- Added checkpointing information for every step of the pipeline, so that previously finished modules don't need to get executed again, when a job gets resubmitted.
## [5.0.0] - 2019-03-05
### Release
- New IMG Annotation Pipeline version that unifies isolate and metagenome processing.
Pipeline v.4
Previously IMG used slightly different pipelines to process isolate genomes and metagenomes.
Publications: