This annotation version includes updates to Pfam v37.0, KEGG v111.1, IMG-NR v20240916, and the addition of the geNomad db v1.7. The geNomad database is now required for the newly added geNomad module in the IMG annotation pipeline, which identifies mobile genetic elements and prompted a minor version upgrade
All isolate genomes have been re-annotated using the updated Pfam and KEGG versions.
Cath-Funfam v.4.2.0
COG v2014
geNomad v1.7
IMG-NR v20240916
KEGG v111.1
Pfam v37.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0
IMGAP v5.3.x
Change Log:
Added genomad/genomad.sh to run the geNomad tool. It'll run the container built from the public IMG geNomad container.
Added new container_run_command option in the config yaml. This will be used to execute containers (currently only used in the new genomad module).
Modified the yaml parser to properly handle values that contain whitespaces (required for the new container_run_command option in the config yaml).
The lastal call now specifically sets the -m argument. Refactored the options/arguments of its wrapper script to accommodate this change.
This annotation version upgrade includes a new IMG-NR and a minor version update to the IMG Annotation Pipeline, driven by the addition of a step for assigning phylogeny to contigs.
Cath-Funfam v4.2.0
COG v2014
IMG-NR v20230629
KEGG v98.0
Pfam v34.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0
IMGAP v5.2.x
Change Log:
Added TMHMM version to decodeanhmm version.
Each module outputs its tool's version info.
The gff_and_final_fasta_stats.py output gets now routed to its own log file.
Fixed bug that caused the HMMER version to be empty in the TIGRFAM GFF file.
Removed last newline from /usr/bin/time format.
Removed tee usage so that outputs get only routed to their respective log files.
Removed thread throttling from the tRNA module.
TMHMM model file does now get detected automatically.
TMHMM model entry in annotation config template.
Added create_scaffold_lineage.py which creates a tsv file containing the consensus lineage for each scaffold based on the gene phylogeny tsv file.
Renamed lastal_img_nr_ko_ec_gene_phylo.sh to lastal_img_nr_ko_ec_gene_phylo_scaffold_lineage.sh.
The img_nr_ko_ec_gene_phylo.log got renamed to img_nr_ko_ec_gene_phylo_scaffold_lineage.log.
Changed split and replace characters in lastal_img_nr_ko_ec_gene_phylo_hit_selector.py to accommodate slight format changes made to the new IMG NR MD5 lookup file.
This annotation version includes updates to Pfam v34.0, KEGG v98.0, and a new IMG-NR. The IMG annotation pipeline received a minor version upgrade due to a complete rewrite of the CDS prediction module, including a new overlap resolution approach. Additionally, the pipeline now considers the top 5 bitscores instead of the top 5 hits.
All isolate genomes have been re-annotated using the updated Pfam and KEGG versions.
Cath-Funfam v4.2.0
COG v2014
IMG-NR v20211118
KEGG v98.0
Pfam v34.0
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0
IMGAP v5.1.x
The cmsearch and hmmsearch scripts will now create an empty output GFF file if no features got predicted.
Changed structural_annotation.sh so GFF files will only get added to gff_files_merger.py if they exist and are not empty.
Fixed typo in gff_files_merger.py argument list.
Specifically stating now when Rfam did not predict any features.
Added set and pipefail commands to bash scripts.
Made the contig name parsing more robust in pick_best_genemark_predictions.sh for non-JGI setups.
Added finalize_fasta_files.py which filters all genes and proteins that are not in the final merged structural annotation GFF out of the CDS genes and proteins fasta files to create their final versions.
The final fasta files produced by the CDS prediction module now contain a 'cds_' in the filename and are the new basis (instead of the fasta files created by the GeneMark and Prodigal modules) for the final genes and proteins fasta files creation (via the newly added finalize_fasta_files.py).
Fixed a bug that caused the pipeline to keep running despite all sequences being filtered out during pre-QC.
Fixed a bug that caused duplicate start_type attributes in the structural/functional annotation GFFs.
Fixed a bug that in specific cases could cause an IndexError in the Contig.create_final_trusted_genes method of the cds_overlap_resolver.py script.
Instead of removing carriage returns they get replaced with newline characters in the pre-QC step now. After that we replace sequential newlines with only one occurrence.
The config file now contains a translation_table entry that can get used to force the usage of a specific genetic code for the CDS prediction. The value 0 will mimic the previous behavior in which the pipeline tries to auto-predict the correct translation table.
Removing the files created by genemark's translation table runs from permanent output files set.
Fasta files not starting with a '>' character fail the pre-QC step.
If an empty line in the fasta file is followed by a line not starting with '>' the pre-QC step fails.
Making sure that allowed overlapping CRISPR and CDS features are not starting at the same position.
The Rfam clan filter script now also removes the lower scoring of two overlapping hits to the same model, even if they are not in a clan.
Making sure GFF import RNAs also get a product name assignment.
Removed import of external python_toolbox and added tprint function to imgap_utils instead.
Completely rewrote CDS prediction and overlap resolution part of the structural annotation phase.
The filtering script for the LAST results now looks at the top n bitscores (instead of top n hits).
We will now combine the annotation database versions and the IMG Annotation Pipeline code version into a single designation called the IMG Annotation Version. Moving forward, there will be at most one major version update per year. These updates may include changes to the major or minor version number of the IMG Annotation Pipeline code (following semantic versioning) and/or updates to newer versions of the annotation databases.
Starting from Pipeline v5.0.0, isolate genomes and metagenomes are processed using the same code base. Moreover, additional functions (Cath-Funfam, SuperFamily and SMART) have been added.
Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8269246/
Highlight of IMG Annotation Pipeline v5.0.0: https://img.jgi.doe.gov/docs/pipelineV5/
Cath-Funfam v4.1.0/v4.2.0
COG v2014
IMG-NR v20190607
KEGG v77.1
Pfam v30
SMART v01_06_2016
SuperFamily v1.75
TIGRFAMs v15.0
IMGAP v5.0.x
If Prodigal and GeneMark predict the same gene, but it gets shortened in two different places, only the shortened GeneMark gene is kept.
The filtering script for the LAST results now makes an additional run over the output to check which subjects are actually needed in the MD5 lookup.
The GFF stats script also creates an additional output file in JSON format now.
Fixed bug that caused the Pfam e-value in the *_pfam.gff to always get set to 13.
The cmsearch command for the Rfam step can now also contain a -Z argument that gets read from the annotation config.
Pulling the /dev/null redirect out of the hmmsearch command variables (since it causes JAWS to fail).
Set the default max overlap ratio back 0.1 for cath-funfam, cog, smart and superfamily.
Making sure that new logs don't get deleted when removing files from a potentially failed previous run.
Added print out of non-IUPAC characters if any get detected in the pre-QC step.
Added number of genes per 1M bp check to the post-QC step.
Increased the default minimum contig length from 150 to 200 bp.
Added check to make sure that the contig sequences only contain IUPAC letters before replacing all non-ACGTN characters with Ns.
The poly-N stretch length indicating a gap of unknown length can now be set via the config file. A 0 turns this feature off.
Commands to remove tmp and results files at the beginning of every step that uses GNU parallel (in case there was a previous run that got killed or failed due to non-pipeline related reasons).
The abortion of the pipeline processing at the encounter of poly-N stretches of length 100 can now be turned on and off via the config file.
The pre-QC step now removes leading/trailing Ns from the contigs' ends and checks for N-stretches of exactly 100 bp, which stand for gaps of unknown length. If such a gap exists the pipeline aborts processing and produces a GAPS_OF_UNKNOWN_LENGTH.txt file that lists the contig names and start positions of these gaps of unknown length.
The pre-QC step now creates a PRE_QC_FAILURE file in the input file directory if any of the QC rules got offended. The file lists the detailed reason.
For metagenomes the tRNA prediction now also uses B and A models instead of the general one.
Each hmmsearch module now executes multiple instances of hmmsearch in parallel. The number of parallel hmmsearch instances is now an entry in the yaml config file.
The -Z argument was added to the command hmmsearch command lines. The value for the -Z argument is now also an entry in the config file.
The minimum contig length used by the pre-qc scripts does now get set via the annotation config yaml.
Additional Warning line handling in tRNAscan parser.
The locus_tag attribute from all GFF files.
The pre-QC and GFF and Fasta stats steps can now get turned on and off via the annotation config file.
Prodigal: If an isolate has less than 20000 bp it will get processed in meta mode.
Fixed some typos.
Tracking of which module got started last (for in-house purposes, to catch datasets that time out on a specific module).
Fixed bug that in some cases caused duplicate genes, when one gene got shortened.
Fixed bug that in some cases caused SignalP or TMHMM results not to be present at the product name assignment step.
Added checkpoint information for every step of the pipeline, so that previously finished modules don't need to get executed again, when a job gets resubmitted.
New IMG Annotation Pipeline version that unifies isolate and metagenome processing.
Previously IMG used slightly different pipelines to process isolate genomes and metagenomes.
Publications: