8 - A note on mining taxonomic information from GenBank.
BioPerl offers modules to easily allow taxonomic information to be mined from GenBank, but what happens when taxids are updated/deleted/merged by GenBank staff?
Taxids get retired when spelling mistakes are discovered and corrected, when taxonomic information is revised and one name is subsumed into another. This is a natural process in taxonomy and database curation.
Sometimes, I like to retrieve the results of GenBank Taxonomy database searches in the form of a taxid list. This list may come from a direct query on the Genbank website, ex."Fungi"[ORGN] AND "phylum"[RANK], other times, it may be from a customized ebot perl script written by Eric W. Sayers. Sometimes, this list of taxids may contain older 'retired' taxids that need to be taken into consideration.
Now that you have a list of taxids from whatever method you like, how can you easily retrieve the taxonomic information? Well, use a BioPerl module of course. The module that I use is Bio::LITE::Taxonomy::NCBI.
For this module to be used, you need to have already downloaded taxonomic information from NCBI. The file I downloaded is called taxdump.tar.gz available from NCBI's ftp site. Be sure to unzip these files into their own directory. The files specifically required by the BioPerl module are names.dmp and nodes.dmp. An additional, and very useful file, is called merged.dmp and this is the file that I would like to bring to your attention.
Before using this module, ensure that you have the most recent version of names.dmp, nodes.dmp, and merged.dmp. If you do not, then you will not be able to retrieve taxonomic information from older taxids that have been deleted/merged with newer taxids.
As is often the case, even though many parts of NCBI's databases are updated to contain the new taxids, there will be times when you end up with an older taxid in your list. When this happens, you need to use the merged.dmp file as a mapping file. The format looks like something like this:
old taxid | new taxid |
and can be used in your perl script to grab the new taxid so that the taxonomic information can be found.
Though this is not a bug in the module, it is an issue that I suspect happens often enough that users should be aware of the merged.dmp file and when it is useful.