Isolate README.txt
Inside each <taxon_oid>.tar.gz bundle file:
<taxon_oid>.fna - FASTA nucleic acid file for taxon.
<taxon_oid>.genes.fna - FASTA nucleic acid file for genes.
<taxon_oid>.genes.faa - FASTA amnio acid file for genes.
<taxon_oid>.intergenic.fna - FASTA for intergenic regions.
<taxon_oid>.gff - Tab delimited in mostly GFF3 format for genes.
(Strict conformance is not guarantted, esp. in type and attributes fields.)
<taxon_oid>.cog.tab.txt - Tab delimited file for COG annotation.
<taxon_oid>.kog.tab.txt - Tab delimited file for KOG annotation.
<taxon_oid>.pfam.tab.txt - Tab delimited file for Pfam annotation.
<taxon_oid>.tigrfam.tab.txt - Tab delimited file for TIGRFAM annotation.
<taxon_oid>.ipr.tab.txt -
Tab delimited file for "other" (Non-Pfam/TIGRFAM) InterPro hits.
(Optional)
<taxon_oid>.ko.txt - Tab delimited file for KO and EC annotation.
<taxon_oid>.signalp.txt - Tab delimited file for signal peptide annotation.
<taxon_oid>.tmhmm.txt - Tab delimited file for transmembrane helices.
<taxon_oid>.xref.tab.txt - Tab delimited file for external references.
(Data is spotty and optional.)
<taxon_oid>.crispr.txt - Tab delimited file for CRISPR details.
(For some of the smaller genomes, e.g., viruses, not all
annotation files is present. If the genome has no
annotation of a certain type, e.g. no TIGRFAM's,
the annotation file <taxon_oid>.tigrmfam.tab.txt is not there.
Some annotations are not done at all, e.g. InterPro for metagenomes.
Any file that does not have annotations or has no data will
not be present.)
------------
Structure of each tab delimited file:
<taxon_oid>.gff
-- seqid - Sequence ID
-- source - version of IMG database
-- type - feature type
-- start_coord - starting coordinate
-- end_coord - ending coordinate
-- score - NA
-- strand
-- phase - NA
-- attributes - ID=<gene_oid>;locus_tag=<locus_tag>;product=<product name>
<taxon_oid>.cog.tab.txt (from NCBI RPSBLAST)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- percent_identity - Perceent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- cog_id - COG identifier
-- cog_name - COG name
-- cog_length - Length of COG consensus sequence
<taxon_oid>.kog.tab.txt (from NCBI RPSBLAST)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- percent_identity - Perceent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- kog_id - KOG identifier
-- kog_name - KOG name
-- kog_length - Length of KOG consensus sequence
<taxon_oid>.pfam.tab.txt (from EBI's pfam_scan which uses HMMER 3.0)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- pfam_id - Pfam identifier
-- pfam_name - Pfam name
-- pfam_length - Length of Pfam consensus sequence
<taxon_oid>.tigrfam.tab.txt (from hmmscan HMMER3.0)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- tigrfam_id - TIGRFAM identifier
-- tigrfam_name - TIGRFAM name
<taxon_oid>.ipr.tab.txt
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- domaindb - Original domain database.
-- domainid - ID on original domain database.
-- iprid - InterPro ID.
-- iprdesc - InterPro description.
-- go_info - Gene Ontology Information
<taxon_oid>.ko.tab.txt (from NCBI BLASTP on KEGG genes)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- percent_identity - Perceent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- ko_id - KEGG Orthology (KO) identifier
-- ko_name - KO name
-- EC - Enzyme Commission (EC) assignment from KO
-- img_ko_flag - 'Yes' (assigned by IMG pipeline); 'No' - from KEGG.
<taxon_oid>.signalp.tab.txt (SignalP)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- feature_type - "cleavage"
-- start_coord - start coordinate of feature
-- end_coord - end coordinate of feature
<taxon_oid>.tmhmm.tab.txt (TMHMM)
-- gene_oid - Gene object identifier of query gene
-- gene_length - Length of protein sequence
-- feature_type - feature
-- start_coord - start coordinate of feature
-- end_coord - end coordinate of feature
<taxon_oid>.xref.tab.txt
-- gene_oid - Gene object identifier of query gene
-- db_name - External database
-- id - External ID corresponding to database
<taxon_oid>.crispr.txt (optional)
-- contig_id - Contig/Scaffold ID
-- crispr_no - CRISPR number
-- pos - Starting position of array element
-- repeat_seq - Repeat sequence
-- spacer_seq - Spacer sequence
-- tool_code - Single letter code for tool used
IMG tar file example 3300048887.tar.gz
README.txt
IMG Pipeline v.5.0 will include the following files available for downloading from JGI Genome Portal. All files are compressed into a single downloadable file named <taxon_oid>.tar.gz
The content below will be added as a file called b00.bundle.README.txtb :
<taxon_oid> - Corresponds to taxon object identifier.
Inside each <taxon_oid>.tar.gz bundle file:
<au> is either "a"ssembled or "u"nassembled.
<taxon_oid>.<au>.fna - FASTA nucleic acid file for taxon.
<taxon_oid>.<au>.faa - FASTA amino acid file for taxon.
<taxon_oid>.<au>.gff - GFF3 format file with annotation
<taxon_oid>.<au>.cog.txt - Tab delimited file for COG annotation.
<taxon_oid>.<au>.cog.hmmout - Raw output (domtblout) of hmmsearch with COG HMMs
<taxon_oid>.<au>.pfam.txt - Tab delimited file for Pfam annotation.
<taxon_oid>.<au>.pfam.hmmout - Raw output (domtblout) of hmmsearch with Pfam HMMs
<taxon_oid>.<au>.tigr.txt - Tab delimited file for TIGRFAM annotation.
<taxon_oid>.<au>.tigr.hmmout - Raw output (domtblout) of hmmsearch with TIGRfam HMMs
<taxon_oid>.<au>.cathfunfam.txt - Tab delimited file for CATH FUNFAM annotation.
<taxon_oid>.<au>.cathfunfam.hmmout - Raw output (domtblout) of hmmsearch with CATH/FunFam HMMs
<taxon_oid>.<au>.supfam.txt - Tab delimited file for SUPERFAM annotation.
<taxon_oid>.<au>.supfam.hmmout - Raw output (domtblout) of hmmsearch with SupFam HMMs
<taxon_oid>.<au>.smart.txt - Tab delimited file for SMART annotation.
<taxon_oid>.<au>.smart.hmmout - Raw output (domtblout) of hmmsearch with SMART HMMs
<taxon_oid>.<au>.rfam.txt - Tab delimited file for non-coding RNA and regulatory RNA motif and binding site annotation.
<taxon_oid>.<au>.phylodist.txt - Tab delimited file for Phylo Distribution (best LAST hits against non-redundant protein database derived from high-quality IMG genomes).
<taxon_oid>.<au>.ko.txt - Tab delimited file for KO annotation.
<taxon_oid>.<au>.ec.txt - Tab delimited file for EC annotation.
<taxon_oid>.<au>.gene_product.txt - Tab-delimited file with protein product name assignments.
<taxon_oid>.<au>.depth.txt - Tab-delimited file with average per-contig read depth (optional, available only for some metagenome and metatranscriptome datasets).
<taxon_oid>.<au>.map.txt - Tab-delimited file with mapping of original contig/read IDs (headers of submitted fasta file) to IMG contig names (optional).
<taxon_oid>.<au>.crispr.txt - Tab-delimited file for CRISPR array annotation details (optional).
------------
Structure of a .gff file and tab-delimited text files:
<taxon_oid>.<au>.gff
-- seqid - Sequence ID
-- source - version of IMG database
-- type - feature type
-- start_coord - starting coordinate
-- end_coord - ending coordinate
-- score - NA
-- strand
-- phase - NA
-- attributes - ID=<feature_id>;locus_tag=<gene_id>;product=<initial product>
<taxon_oid>.<au>.cog.txt (from NCBI RPSBLAST or hmmsearch with COG HMMs)
-- gene_id - Gene object identifier of query gene
-- cog_id - COG identifier
-- percent_identity - Percent identity of aligned amino acid residues (Not valid for HMM's, retained for compatibility with legacy data)
-- align_length - Alignment length
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
<taxon_oid>.<au>.pfam.txt (from hmmsearch with Pfam HMMs)
-- gene_id - Gene identifier of query gene
-- pfam_id - Pfam identifier
-- percent_identity - (Always "100%". Not valid for HMMs, retained for compatibility with legacy data)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.tigr.txt (TIGRFAM annotation: optional)
-- gene_id - Gene identifier of query gene
-- tfam_id - TIGRFAM identifier
-- percent_identity - (Always "100%". Not valid for HMMs, retained for compatibility with legacy data)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.cathfunfam.txt (CATH FUNFAM annotation)
-- gene_id - Gene identifier of query gene
-- cathfunfam_id - CATH FUNFAM identifier
-- percent_identity - Percent identity match in alignment (Not valid for HMM's, retained for compatibility with legacy data)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.supfam.txt (SUPERFAM annotation)
-- gene_id - Gene identifier of query gene
-- superfam_id SUPERFAM identifier
-- percent_identity - Percent identity match in alignment (Not valid for HMM's, retained for compatibility with legacy data)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.smart.txt (SMART annotation)
-- gene_id - Gene identifier of query gene
-- smart_id SMART identifier
-- percent_identity - Percent identity match in alignment (Not valid for HMM's, retained for compatibility with legacy data)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.phylodist.txt (from LAST on non-redundant database of IMG proteins extracted from high-quality genomes)
-- gene_id - Gene identifier of query gene
-- homolog_gene_oid - IMG gene object identifier of LAST hit (subject sequence)
-- homolog_taxon_oid - IMG taxon object identifier of LAST hit protein (subject sequence)
-- percent_identity - Percent identity match in alignment
-- lineage - domain;phylum;class;order;family;genus;species;taxon_name of the genome in which LAST hit was found
<taxon_oid>.<au>.ko.txt (from LAST on IMG genes)
-- gene_id - Gene object identifier of query gene
-- img_ko_flag - (IMG generated KO assignment. Always 'Yes'.)
-- ko_term - KEGG Orthology (KO) identifier of LAST hit (subject sequence)
-- percent_identity - Percent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.ec.txt (from LAST on IMG genes)
-- gene_id - Gene object identifier of query gene
-- img_ko_flag - (IMG generated KO assignment. Always 'Yes'.)
-- EC - EC derived from KEGG Orthology (KO) identifier of LAST hit (subject sequence)
-- percent_identity - Percent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<taxon_oid>.<au>.gene_product.txt (from COG, Pfam, TIGRfam)
-- gene_id - Gene identifier
-- product_name - Product name
-- source - Source of assignment
<taxon_oid>.<au>.depth.txt (optional)
-- orig_id - Original or current contig ID
-- depth - average per-contig read depth
-- length - length of contig (optional)
-- ref_gc - GC contig (optional)
-- base_cov - Percent of contig with coverage (optional)
<taxon_oid>.<au>.map.txt (optional)
-- orig_id - Original sequence ID (derived from the headers of fasta file submitted to IMG)
-- new_id - New sequence ID assigned by IMG annotation pipeline
<taxon_oid>.crispr.txt (optional)
-- contig_id - Contig/Scaffold ID
-- crispr_no - CRISPR number
-- pos - Starting position of array element
-- repeat_seq - Repeat sequence
-- spacer_seq - Spacer sequence
-- tool_code - Single letter code for tool used
Individual Bin Data
<bin_oid> - Corresponds to metagenome bin object identifier.
Inside each <bin_oid>.tar.gz bundle file:
<bin_oid>.fna - FASTA nucleic acid file of scaffolds for genome bin.
<bin_oid>.faa - FASTA amino acid file for genome bin.
<bin_oid>.gff - Tab delimited GFF3 format for genes.
<bin_oid>.cog.txt - Tab delimited file for COG annotation.
<bin_oid>.pfam.txt - Tab delimited file for Pfam annotation.
<bin_oid>.tigr.txt - Tab delimited file for TIGRFAM annotation
(optional).
<bin_oid>.phylodist.txt - Tab delimited file for phylo distribution
<bin_oid>.ko.txt - Tab delimited file for KO annotation.
<bin_oid>.ec.txt - Tab delimited file for EC annotation.
<bin_oid>.gene_product.txt - Product name assignment.
<bin_oid>.crispr.txt - Tab delimited file for CRISPR details.
------------
Structure of each tab delimited file:
<bin_oid>.gff
-- seqid - Sequence ID
-- source - version of IMG database
-- type - feature type
-- start_coord - starting coordinate
-- end_coord - ending coordinate
-- score - NA
-- strand
-- phase - NA
-- attributes - ID=<feature_id>;locus_tag=<gene_id>;product=<initial product>
<bin_oid>.cog.txt (from NCBI RPSBLAST)
-- gene_id - Gene object identifier of query gene
-- cog_id - COG identifier
-- percent_identity - Perceent identity of aligned amino acid residues
-- align_length - Alignment length
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
<bin_oid>.pfam.txt (from EBI's pfam_scan which uses HMMER 3.0)
-- gene_id - Gene identifier of query gene
-- pfam_id - Pfam identifier
-- percent_identity - (Always "100%". Not valid for HMM's.)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<bin_oid>.tigr.txt (TIGRFAM annotation: optional)
-- gene_id - Gene identifier of query gene
-- tfam_id - TIGRFAM identifier
-- percent_identity - (Always "100%". Not valid for HMM's.)
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<bin_oid>.phylodist.txt (from LAST on IMG genes)
-- gene_id - Gene identifier of query gene
-- homolog_gene_oid - IMG homolog gene object identifier
-- homolog_taxon_oid - IMG taxon object identifier
-- percent_identity - Percent identity match in alignment
-- lineage - domain;phylum;class;order;family;genus;species;taxon_name
<bin_oid>.ko.txt (from LAST on IMG genes)
-- gene_id - Gene object identifier of query gene
-- img_ko_flag - (IMG generated KO assignment. Always 'Yes'.)
-- ko_term - KEGG Orthology (KO) identifier
-- percent_identity - Perceent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<bin_oid>.ec.txt (from LAST on IMG genes)
-- gene_id - Gene object identifier of query gene
-- img_ko_flag - (IMG generated KO assignment. Always 'Yes'.)
-- EC - EC derived from KEGG Orthology (KO) identifier
-- percent_identity - Perceent identity of aligned amino acid residues
-- query_start - Start coordinate of alignment on query gene
-- query_end - End coordinate of alignment on query gene
-- subj_start - Start coordinate of alignment on subject sequence
-- subj_end - End coordinate of alignment on subject sequence
-- evalue - Expectation value
-- bit_score - Bit score of alignment
-- align_length - Alignment length
<bin_oid>.gene_product.txt (from COG, Pfam, TIGRfam)
-- gene_id - Gene identifier
-- product_name - Product name
-- source - Source of assignment
<bin_oid>.crispr.txt (optional)
-- contig_id - Contig/Scaffold ID
-- crispr_no - CRISPR number
-- pos - Starting position of array element
-- repeat_seq - Repeat sequence
-- spacer_seq - Spacer sequence
-- tool_code - Single letter code for tool used
===============================
There is also a mbin_datafile_<taxon_oid> file for all MQ and HQ bins extracted from this metagenome.
-- IMG Bin ID - Metagenome bin OID
-- Bin Quality - Bin quality; MQ: medium quality; HQ: high quality
-- Bin Lineage - Bin lineage determined by IMG scaffold lineage
GTDB-TK lineage - Bin lineage determined by GTDB-TK
Bin Completeness - Bin completeness %
Bin Contamintation - Bin contamination %
Total Number of Bases - Toal number of bases of the bin
Number of genes - Total number of genes in the bin
Num of 5s rRNA - Total number of 5s rRNA genes in the bin
Num of 16s rRNA - Total number of 16s rRNA genes in the bin
Num of 23s rRNA - Total number of 23s rRNA genes in the bin
Num of tRNA - Total number of tRNA genes in the bin
IMG scaffold IDs of members - OIDs of all IMG scaffolds in the bin