Introduction into IMG video on YouTube: JGI's Integrated Microbial Genomes and Metagenomes (IMG) system
View our Webinar Series on YouTube.
For more information and Q & A see IMG Webinar.
Topics that were covered:
IMG Sequence Similarity Search (Blast)
Analysis Carts and Workspace
Statistical Analysis Tool
ANI (Average Nucleotide Identity)
Genome Search - Advanced Search Builder
IMG data export and download
Metagenome Bins
Data Submission & Management
Using IMG is free to use. You do not need an account to use public IMG/M. But public IMG has limited features and tools.
IMG/MER which requires an JGI SSO account has many more features: Workspace, new Statistical Analysis Tools, and more.
You can sign up for an account at: JGI Single Sign On (JGI SSO)
I'm stuck on the log in page.
Logout at JGI SSO https://signon.jgi.doe.gov/ and clear all browser's cookies and cache
I created an account, but IMG said I already have an account. Please contact us.
See our glossary of terms used in IMG and GOLD.
We use Google Services; analytics and reCAPTCHA etc...
Some ISP's DNS servers are blocking signon.jgi.doe.gov. Try using openDNS or Google's DNS servers.
You must allow 3rd party cookies and accept both https and http content.
Please see full system requirements here
Q: Is the IMG available for local installations so that we could incorporate our own microbial genome data into IMG on our own servers?
A: IMG is not available for local installations. Resources required for making the IMG collection of tools and databases ready for installation at other sites are not available at this time.
Q: How is IMG different from other systems like PUMA, or SEED?
A: While IMG shares goals with systems such as PUMA and SEED (see IMG Lineage), IMG's main goal is to support the analysis of the genomes sequenced at JGI.
Q: What is "Finished", "Draft", "Permanent Draft" and so on?
A: Draft: It's incomplete and therefore a draft. More complete versions of genomes are likely to appear.
Finished: It's complete. There won't be new versions unless errors are found.
Permanent Draft: It's still incomplete, but it's pretty good already. There is no plan to further improve the version
See section Carts and Workspace
See section Carts and Workspace
IMG download requires an JGI SSO account.
See IMG Webinar series "IMG data export and download" on how to download IMG data sets.
Other download documentation, please see IMG Help section Download.
Please see section Submission FAQ
How does Gene Detail page Conserved Neighborhood "Show neighborhood regions with the same top COG hit (via top homolog)" work?
The hits are from isolate blast all (top 500 hits) with the following filters:
All gene hits must have a COG annotation.
All self hits were removed.
Show only genome cart hits if it is not empty.
Remove any hits with a bit score > 10000.
Remove any hits with a percent identity < 50%.
Length filterung:
Remove hits with length > max length ( where max length = query sequence length * 1.3 )
Remove hits with length < min length ( where min length = query sequence length * 0.7 )
See Alternative to Gene Conserved Neighborhood to get Gene Neighborhoods without filtering,
Question:
I compiled a fasta file using all the metagenomic protein sequences (available - according to the policy) followed by clustering with mmseqs at 70% identity.
I ended up with around 10B sequences, but on closer inspection, ~2B of these sequences have stop codons in them. e.g.
>Ga0114922_133083141
FDSTQDEEKTDKKAKKSPARSRVRVKVNNDWTKETVVTK*
Most of the problematic sequences (86%) only have a stop codon at the end, but some of them have multiple stop codons in the sequence.
I am trying to understand why this happens by understanding the JGI/IMG pipeline and the gene predictors Prodigal/GeneMark. However, I can't seem to understand the place where the stop codons stem from – especially that these sequences don't have fasta headers that resemble Prodigal outputs e.g
>Ga0494396_0041084_1_84 # Prodigal v2.6.3 # 1 # 84 # - # tt=11 # partial=3' # start_type=ATG
I am mainly interested to keep valid sequences and discard possible wrong predictions.
Could you please offer me some context to the issue above?
Answer:
The issue is that the data produced by probably five different pipeline versions. At least since 5 version were used. We remove trailing * and replace all other * within proteins with X for Prodigal.
Tthere shouldn't be any stop codons in the data annotated with pipeline v 5+.
In the older version of the pipeline translations with the trailing stop codons were generated by Prodigal. The def lines in fasta files don't look like what Prodigal normally produces, because they are generated by the functional annotation, which Prodigal doesn't do.
In the metagenomes annotated by older pipeline there can be also in-frame stop codons in the sequences. These would be in the contigs that Prodigal annotated in meta mode with Mycoplasma as a model. These contigs could have genetic code 4 or 25 - Prodigal can't distinguish between the two. These are still valid protein sequences, despite the presence of stop codons.
There is a very small set of really old metagenomes annotated by the very first version of the pipeline, which included FragGeneScan. These proteins can also have in-frame stop codons, but, unlike the rest of IMG annotations, most of these are hallucinations of FragGeneScan, and the sequences would be invalid.
Since the goal seems to be protein clustering, the user can get rid of these invalid sequences by discarding the singletons. Alternatively, just exclude any early datasets - most of them are small, anyway.
Question
The output from Prodigal can be directly stored as the output sequence (I can see the nicely formatted header https://github.com/hyattpd/prodigal/wiki/understanding-the-prodigal-output#protein-translations) or it can go through functional annotation steps and the header is lost? In this scenario, what would be correct way to categories all sequence data into full proteins or fragments? Especially for these sequences with headers that have been lost and we can't use.
Answer:
Both "-" and "*" in the beginning of a protein are found in partial CDSs, in which the 1 or 2 nucleotides of a codon cannot be unambiguously translated into any amino acid.
WRT Prodigal output, there is a popular belief that Prodigal is capable of distinguishing between full-length proteins and protein fragments with perfect accuracy. Unfortunately, this is not the case. Prodigal will often call a full-length CDS (i. e. the one with a valid start and stop codons) a protein fragment, simply because it finds an internal start codon that looks good enough. We've run this analysis a long time ago, and I don't have the exact count right now, but my recollection is that Prodigal called about 25% of partial CDSs as full-length.
If the goal of your processing is to remove all potential protein fragments, the only way to achieve it is by dropping all CDSs on the edge of the contigs, regardless of whether Prodigal identified it as a partial or not.
This should be done for two reasons: 1) Prodigal can't identify all partial proteins and 2) in metagenomes, where assembled contigs and scaffolds often have fairly low coverage, the likelihood of sequencing and assembly artifacts increases towards the ends of the contigs, so translations of CDSs at the edges of the contigs are more likely to be erroneous.
Generally speaking, metagenomic sequences are much more noisy than those from isolate genomes, so it would make sense to cluster proteins at 90% identity or thereabouts, and throw out all singletons. This filtering would have removed many, if not most of the problematic sequences you were asking about, with possible exceptions of stop codon-containing translations of proteins with genetic codes 4 and 25 annotated by Prodigal.