Paper 10

New Developments in the Kyoto Encyclopedia of Genes and Genomes

The Kyoto Encyclopedia of Genes and Genomes (KEGG) has been cataloguing data on genes, proteins, and pathways for more than twenty years1. In that time, its databases have amassed a vast wealth of information through the use of numerous research tools and integration into the still more expansive GenomeNet2. Some of the most exciting new developments include orthology algorithms and the application of the databases, particularly PATHWAY and the health-related ones, to studying diseases, including cancer1. Nowadays scientists routinely refer to the KEGG for their research into molecular biology.

The KEGG was developed in 1995, as part of the Human Genome Program of the Ministry of Education, Science, Sports and Culture in Japan, with the goal of aggregating and computerizing important genes, proteins, and pathways, as well as the relationships between them3. At this time, the Human Genome Project was well underway, having begun in the United States during the mid-eighties4. The original KEGG had only four databases: GENES, PATHWAY, COMPOUND, and ENZYME1. Interactions between genes and molecules were conceptualized using a method of binary relations, which employs pairwise comparisons. Genes, proteins, and molecules were classified into hierarchies based on their degree of similarity5. Links to other databases, including BLAST, FASTA, and SWISS-PROT, for cross-referencing purposes, were obtained using DBGET and LinkDB2,6. Pathways were generated and curated manually, and if an enzyme was missing alternative pathways for a particular reaction could be generated5. The earliest and easiest pathways to be incorporated in KEGG were metabolic pathways, because they were greatly conserved between organisms3.

Around the turn of the millennium, researchers began to take more interest in the KEGG. While papers produced on the KEGG and its functionality received only a few hundred citations at most in 19975,7 or 19988, the updated report published in 1999 garnered almost three thousand citations9, whereas the report from 2000 was cited more than nine thousand times3. The publication of the first draft of the human genome in 2000 might have contributed to this upswing in interest4. Around this time, the KEGG was updated with ortholog tables3,9, and the COMPOUND and ENZYME databases were consolidated into the LIGAND database, which also included REACTION, a database designed to track enzyme-substrate reactions10. By the mid-2000’s, KEGG had four major databases: PATHWAY, GENES, LIGAND, and BRITE, the latter of which focused mostly on ontology and organization of information. One of the components of BRITE was Kegg Orthology, or KO, used to annotate and identify orthologous genes and related proteins11.

The modern KEGG database contains fifteen databases in all, grouped into four categories: systems information, which focuses on higher-level functioning of molecular data, including pathways; genomic information, which includes data on genes and their associated annotations; chemical information, which includes data on proteins and important chemical compounds; and health information, which includes data on diseases and drugs1. Gene and peptide annotation data is stored in KO, which now has its own algorithms, BlastKOALA and GhostKOALA, to annotate genomes1,12. BlastKOALA, which uses BLASTP, is the higher-precision algorithm and used on single genomic sequences, whereas the faster but less-precise GhostKOALA, which uses GHOSTX, is generally selected for analysis of metagenomes12. The GENES database also now includes smaller fragments of DNA, such as plasmids, as well as whole genomes1. The PATHWAY database, meanwhile, has expanded its reach from metabolic and regulatory pathways to pathways analyzing drug resistance. KEGG Mapper is the software used to analyze higher-level processes, including pathways, and can also link them to specific genes or proteins1. As a result, the databases are not separate but dynamically integrated.

One of the more promising new applications of KEGG is the study of health problems, including cancer, and treatments. The DISEASE and DRUG databases contain detailed information about how a disease or drug impacts a specific gene or pathway1,13. DISEASE entries, like all KEGG data, are organized into hierarchies, with multiple variants on a disease occurring under the same umbrella. Entries in the DRUG database, which catalogs the chemical formulas of drugs as well as their uses and contraindications, are typically based on Japanese drug labels, though FDA labels are sometimes used1,13. In the field of cancer biology, KEGG pathways are now routinely used to analyze genetic and epigenetic perturbations associated with carcinogenesis14-17. For example, one recent study used data from KEGG to discern potential health problems of exposure to the Deepwater Horizon oil spill17.

All in all, the Kyoto Encyclopedia of Genes and Genomes (KEGG) has come a long way since the project began in 1995. The increasing availability of genomic information, combined with more sophisticated software tools, have led to numerous advancements in the past two decades. In a world where DNA sequencing technologies are becoming easier and cheaper, KEGG is an invaluable resource for scientists studying microbiology, whether to cure disease, understand how organisms function, or assemble the population of a microbiome. Most likely, new methods of using KEGG will develop as time goes on.

References

1. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nuc. Acid. Res., 45, D353-D361 (2017)

2. Kanehisa, M. Linking databases and organisms: GenomeNet resources in Japan. Trends Biochem. Sci., 22, 442-444 (1997)

3. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nuc. Acid. Res., 28, 27-30 (2000)

4. Collins, F.S., Morgan, M. & Patrinos. A. The Human Genome Project: Lessons from Large-Scale Biology. Science, 300, 286-290 (2003)

5. Goto, S., Bono, H., Ogata, H., Fujibuchi, W., Nishioka, T., Sato, K., & Kanehisa, M. Organizing and computing metabolic pathway data in terms of binary relations. Pac. Symp. Biocomput., 175-86 (1997)

6. Fujibuchi, W., Goto, S., Migimatsu, H., Uchiyama, I., Ogiwara, A., Akiyama, Y. & Kanehisa, M. DBGET/LinkDB: an integrated database retrieval system. Pac. Symp. Biocomput., 683-694 (1998)

7. Kanehisa, M. A database for post-genome analysis. Trends Genet., 13, 375-376 (1997)

8. Ogata, H., Goto, S., Fujibuchi, W., Kanehisa, M. Computation with the KEGG pathway database. BioSystems, 47, 119-128 (1998)

9. Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H. & Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nuc. Acid. Res., 27, 29-34 (1999)

10. Goto, S., Nishioka, T. & Kanehisa, M. LIGAND: chemical database of enzyme reactions. Nuc. Acid. Res., 28, 380-382 (2000)

11. Kanehisa, M. et al. From genomics to chemical genomics: new developments in KEGG. Nuc. Acid. Res., 34, D354-D357 (2006)

12. Kanehisa, M., Sato, Y. & Morishima, K. BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. J. Mol Biol, 428, 726-731 (2016)

13. Kanehisa, M. KEGG: Kyoto Encyclopedia of Genes and Genomes (2017) www.kegg.jp [Accessed November 8, 2017)

14. Kaushal, A., Zhang, H., Karmaus, W.J.J., Ray, M., Torres, M.A., Smith, A.K., Wang, S.-L. Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinformatics, 18, 216 (2017)

15. Shi, K.-Q. et al. Hepatocellular carcinoma associated microRNA expression signature: integrated bioinfromatica analysis, experimental validation and clinical significance. Oncotarget, 6, 25093-25108 (2015)

16. Klein, H.-U., Schäfer, M. ,Porse, B.T., Hasemann, M.S., Ickstadt, K. & Dugas, M. Integrative analysis of histone ChIP-seq and transcription data using Bayesian mixture models. Bioinformatics, 30, 1154-1162 (2014)

17. Liu, Y.-Z., Zhang, L., Roy-Engel, A.M., Saito, S., Lasky, J.A., Wang, G., Wang, H. Carcinogenic effects of oil dispersants: A KEGG pathway-based RNA-seq study of human airway epithelial cells. Gene, 602, 16-23 (2017)