Kyubum "Kyu" Lee, PhD

Biomedical AI/ML Scientist

Principal Data Scientist at Amgen

Thousand Oaks, California, USA

E-mail: KYUBUMLEE [at] gmail [dot] com

Summary

  • Expertise in artificial intelligence and data mining, specifically in machine learning, natural language processing (NLP), and biomedical informatics

  • Extensive experience and in-depth knowledge of deep learning, statistics, data/text analysis, visualization, data integration, and biomedical data

  • Spearhead multiple projects collaborating with researchers in the computer science and biomedical fields

  • - First-author of 8 publications and co-author of 13 publications in peer-reviewed SCI journals

  • - Built 10 web-based services each with more than 1K users including LitVar, ChimerDB 3.0, HiPub, BEST, TIMEx and BRONCO

  • Fluent in Python and vast experience in Scikit-learn, TensorFlow, Keras, NLTK, and various Data science and BioNLP tools


Experience / Education

• Develop deep learning methods for oral and lung cancer histopathology image analysis

- Deep learning-based (Mask R-CNN) image analysis method for understanding tumor heterogeneity and tumor microenvironment

• Comprehensive ORAL cancer Explorer (CORALE) project: develop a multi-omics clinicopathological data web portal

- Processed 62 multi-omics datasets: 9 RNA-seq dataset alignment, 53 microarray dataset normalization and clinical feature normalization

- Developed bioinformatics analysis and visualization tools

• Participate in oral and lung cancer-related biomedical informatics analysis projects

• Orchestrated a research project which involved using deep learning for biomedical literature triage – Collaboration with the European Bioinformatics Institute (UK) and the Swiss Institute of Bioinformatics (Switzerland): reduced the manual workload by nearly 70% using machine learning + NLP

• Develop a web-based machine learning platform that provides tools for classifying biomedical literature

• Conduct analyses of publications on precision health to identify human genes of translational value – Collaboration with the Centers for Disease Control and Prevention (CDC)

• Participated in designing the search engine LitVar which retrieves genomic variants in biomedical documents in PubMed and PMC

• Orchestrated a research project on building a cancer mutation knowledge-base (VarDrugPub) which involved using deep learning for extracting gene-drug-disease-mutation relations from biomedical literature

• Mentoring graduate/undergraduate students

• Built Biomedical Entity Relation Corpus (BRONCO) which contains gene-drug-disease-mutation relations found in biomedical texts

• Extracted gene-drug relations from biomedical texts using NLP tools for creating the Drug Signatures Database (DSigDB)

• Collaborated cross-functionally with researchers in translational bioinformatics and cancer biology, providing advice on biomedical projects

• PhD Thesis: Text mining approaches for knowledge extraction from biomedical literature

• VarDrugPub: Extracting Gene-Variant-Drug information from biomedical literature – Web service / Database

• HiPub: An application that translates biomedical texts to networks – Chrome extension

• ChimerDB 3.0 (ChimerPub): A database for fusion genes from biomedical literature – Web service / Database

• M.S. Thesis: Drug-drug interaction analysis using heterogeneous biological information network


Skills / Experience

  • Machine Learning

    • Deep learning: Convolutional Neural Networks, Transformers, BERT (BioBERT)

    • SVM, Random Forests, Decision Tree, Logistic Regression, Ridge/Lasso Regression (Classifiers), KNN, Linear regression

    • Hierarchical clustering, K-means clustering

    • Ensemble methods (Majority Voting, Bagging, Stacking)

    • Statistical approaches

  • Text mining / NLP

    • Named-Entity Recognition (NER) / Named-Entity Normalization (NEN)

      • Information extraction (Relation extraction)

      • Extracted information to database / network  Web-service

      • Literature triage / Text classification using machine learning (deep learning)

      • Semi-automated curation (Manual curators + NLP)

      • Word/Text embedding (Word2vec, Sent2vec, BERT and etc.)

      • Literature search / Information retrieval

      • Topic modeling (Latent Dirichlet allocation (LDA))

      • Regular Expressions

    • Data handling

    • Data collection and integration (using various APIs and tools)

      • Biomedical data resources

        • Accessing literature databases: PubMed, PMC, bioRxiv, EuropePMC

        • Other resources: NCBI(API), MeSH, ICD10, UMLS/MetaMap, PubChem, ChEBI, Disease ontology, PubTator, KEGG, STRING db, TCGA, GEO, ArrayExpress, cBioPortal

      • Data cleaning

      • Data visualization

      • SQL, NoSQL (MongoDB, Neo4j)

  • Biomedical (Multi-omics) data analysis

    • Transcriptomics data analysis

      • Survival analysis

      • Microarray data normalization / RNAseq alignment

      • Clinical data (term) normalization

      • Tumor-immune microenvironment analysis using bulk transcriptomics data

  • Image analysis

    • Cell segmentation / subtype classification using deep learning (Mask R-CNN)

      • Tumor-immune microenvironment analysis using histopathology image

      • Image processing

  • Others

    • Python, Tensorflow, Keras, Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn, JupyterNotebook, Git, AWS, and etc.


Links for Recent Projects

LitSuggest: A Web-based System for Literature Recommendation and Curation using Machine Learning (Co-First Author / Developed ML core and Data processing part.) [Link] [Publication in Nucleic Acids Research]

CORALE: Comprehensive ORAL cancer Explorer (First Author / Data processing and Analysis tool developing) [Link] [Publication in progress]

Literature Triage using Machine Learning (First Author) [GitHub] [Publication in PLOS Computational Biology]

VarDrugPub: Variant-Gene-Drug relations Database (First Author) [Link] [Publication in BMC Bioinformatics]

TIMEx: Tumor-immune microenvironment deconvolution web-portal for bulk transcriptomics using pan-cancer scRNA-seq signatures (Participated in data processing) [Link] [Pulbication in Bioinformatics]

LitVar: Search Engine for Genomic Variants in PubMed and PMC [Link] [Publication in Nucleic Acids Research]

HiPub: An application that translates biomedical texts to networks (First Author) [Link] [Publication in Bioinformatics] [AltMetric]

BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations (First Author) [Link] [Publication in Database]

ChimerDB 3.0 (Co-first author, Built a fusion gene DB using Text-mining - ChimerPub) [Link] [Wikipedia] [Publication in Nucleic Acids Research] [PDF]

BEST: Biomedical Entity Search Tool (Actively Involved) [Link] [Publication in PLOS ONE]

BEReX: Biomedical Entity-Relationship eXplorer (Actively Involved) [Link] [Publication in Bioinformatics]

DSigDB: Drug SIGnatures DataBase (Involved in building text-mined drug-gene sets) [Link] [Publication in Bioinformatics]


Other Links:

Google Scholar

ResearchGate

PubMed

LinkedIn


Awards:

  • Recognition for Honorary Award (Recognition and Appreciation of Special Achievement): National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services; Dec. 2018

  • Best Paper of the Year Award: Korea University; Feb. 2017


Publications:

Journals:

    • Kyubum Lee†, John H. Lockhart†, Mengyu Xie, Ritu Chaudhary, Robbert J. Slebos, Elsa R. Flores, Christine H. Chung and Aik Choon Tan*: Deep Learning of Histopathology Images at the Single Cell Level. Frontiers in Artificial Intelligence, 2021 ( These authors contributed equally) [Abstract] [Full Text]

    • Alexis Allot†, Kyubum Lee†, Qingyu Chen†, Ling Luo, and Zhiyong Lu*: LitSuggest: A Web-based System for Literature Recommendation and Curation using Machine Learning. Nucleic Acids Research, 2021 ( These authors contributed equally) [Full Text] [Database Link]

    • Mengyu Xie, Kyubum Lee, John H. Lockhart, Scott D. Cukras, Rodrigo Carvajal, Amer A. Beg, Elsa R. Flores, Mingxiang Teng, Christine H. Chung, and Aik Choon Tan*: TIMEx: tumor-immune microenvironment de-convolution web-portal for bulk transcriptomics using pan-cancer scRNA-seq signatures. Bioinformatics, 2021 [Full Text] [Link]

    • Kyubum Lee, Chih-Hsuan Wei, and Zhiyong Lu*: Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Briefings in Bioinformatics, 2020 [Full Text]

    • Qingyu Chen, Kyubum Lee, Shankai Yan, Sun Kim, and Zhiyong Lu*: BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. PLOS Computational Biology, 2020. [Full Text]

    • Kyubum Lee, Mindy Clyne, Wei Yu, Zhiyong Lu*, and Muin Khoury*: Tracking human genes along the translational continuum. npj Genomic Medicine, 2019. [Full Text]

    • Kyubum Lee, Maria Livia Famiglietti, Aoife McMahon, Chih-Hsuan Wei, Jacqueline Ann Langdon MacArthur, Sylvain Poux, Lionel Breuza, Alan Bridge, Fiona Cunningham, Ioannis Xenarios and Zhiyong Lu*: Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLOS Computational Biology, 2018. [Full Text] [Hot paper of the week at NIH Intramural Research News Letter]

    • Alexis Allot†, Yifan Peng†, Chih-Hsuan Wei, Kyubum Lee, Lon Phan, and Zhiyong Lu*: LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Research 2018. [Full Text] [Database Link]

    • Kyubum Lee†, Byounggun Kim†, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sunwon Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan* and Jaewoo Kang*: Deep learning of mutation-gene-drug relations from the literature. BMC Bioinformatics, 2018. DOI: 10.1186/s12859-018-2029-1 [Full Text] [Database Link]

    • Sangrak Lim, Kyubum Lee, Jaewoo Kang: Drug-drug interaction extraction from the literature using a recursive neural network. PLoS ONE, 2018. DOI: 10.1371/journal.pone.0190926 [Full Text]

    • Seongsoon Kim†, Donghyeon Park†, Yonghwa Choi†, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang*: A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis JMIR Medical Informatics 2017. DOI: 10.2196/medinform.8751 [Full Text]

    • Myunggyo Lee†, Kyubum Lee†, Namhee Yu†, Insu Jang†, Ikjung Choi, Pora Kim, Ye Eun Jang, Byounggun Kim, Sunkyu Kim, Byungwook Lee, Jaewoo Kang*, and Sanghyuk Lee*: ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining. Nucleic Acids Research 2017. DOI:10.1093/nar/gkw1083 ( These authors contributed equally) [Full Text] [Database Link]

    • Kyubum Lee, Wonho Shin, Byunggun Kim, Sunwon Lee, Yonghwa Choi, Sunkyu Kim, Minji Jeon, Aik Choon Tan* and Jaewoo Kang*: HiPub: Translating PubMed and PMC Texts to Networks for Knowledge Discovery. Bioinformatics 08/2016; 32(18). DOI:10.1093/bioinformatics/btw511 [Full Text] [Link]

    • Kyubum Lee, Sunwon Lee, Sungjoon Park, Sunkyu Kim, Suhkyung Kim, Kwanghun Choi, Aik Choon Tan* and Jaewoo Kang*: BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. Database The Journal of Biological Databases and Curation 04/2016; 2016. DOI:10.1093/database/baw043 [Full Text]

    • Jocelyn Barbosa, Kyubum Lee, Sunwon Lee, Bilal Lodhi, Jae-Gu Cho, Woo-Keun Seo, Jaewoo Kang*: Efficient quantitative assessment of facial paralysis using iris segmentation and active contour-based key points detection with hybrid classifier. BMC Medical Imaging 12/2016; 16(1). DOI:10.1186/s12880-016-0117-0 [Full Text]

    • Sunwon Lee†, Donghyeon Kim†, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan, Jaewoo Kang*: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature. PLoS ONE 10/2016; 11(10). DOI:10.1371/journal.pone.0164680 († These authors contributed equally to the work.) [Full Text] [Link]

    • Minjae Yoo, Jimin Shin, Jihye Kim, Karen A Ryall, Kyubum Lee, Sunwon Lee, Minji Jeon, Jaewoo Kang, Aik Choon Tan*: DSigDB: Drug Signatures Database for Gene Set Analysis. Bioinformatics 05/2015; 31(18). DOI:10.1093/bioinformatics/btv313 [Link] [Full Text]

    • Woo Keun Seo, Jaewoo Kang, Minji Jeon, Kyubum Lee, Sunwon Lee, Ji Hyun Kim, Kyungmi Oh, Seong Beom Koh: Feasibility of Using a Mobile Application for the Monitoring and Management of Stroke-Associated Risk Factors. Journal of Clinical Neurology 04/2015; 11(2). DOI:10.3988/jcn.2015.11.2.142 [Link]

    • Minji Jeon, Sunwon Lee, Kyubum Lee, Aik-Choon Tan, Jaewoo Kang*: BEReX: Biomedical Entity-Relationship eXplorer. Bioinformatics 01/2014; 30(1). DOI:10.1093/bioinformatics/btt598 [Link] [Full Text]

    • Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: On the efficacy of per-relation basis performance evaluation for PPI extraction and a high-precision rule-based approach. BMC Medical Informatics and Decision Making 04/2013; 13(1). DOI:10.1186/1472-6947-13-S1-S7 [Link]

    • Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: BOSS: Context-enhanced search for biomedical objects. BMC Medical Informatics and Decision Making 04/2012; 12 Suppl 1(Suppl 1). DOI:10.1186/1472-6947-12-S1-S7 [Link]

    • Hanjun Shin, Ki Hoon Kim, Chihwan Song, Injoon Lee, Kyubum Lee, Jaewoo Kang, Yoon Kyoo Kang*: Electrodiagnosis support system for localizing neural injury in an upper limb. Journal of the American Medical Informatics Association 05/2010; 17(3). DOI:10.1136/jamia.2009.001594 [Link]

    • Sunwon Lee, Kyubum Lee, Jaewoo Kang*, Jaehoon Choi, Junho Oh: Trends in Personalized Medicine Research. Communications of the Korean Institute of Information Scientists and Engineers, Vol.29, Issue 4, Pages:19-25, Apr 2011 [Written in Korean]

Conferences / Meetings:

Proceedings:

  • Chih-Hsuan Wei, Kyubum Lee, Robert Leaman, Zhiyong Lu: Biomedical Mention Disambiguation Using a Deep Learning Approach. The 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2019), Niagara Falls, NY; Sept 2019 [Link]

  • Donghyeon Kim†, Sunwon Lee†, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Aik-Choon Tan, Jaewoo Kang*: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature. The 5th Annual Translational Bioinformatics Conference (TBC 2015), Tokyo, Japan; Sept. 2015 († These authors contributed equally to the work.)

  • Kyubum Lee, Sunwon Lee, Minji Jeon, Jaehoon Choi, Jaewoo Kang*: Drug-drug interaction analysis using heterogeneous biological information network. IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, USA; Oct. 2012 [Link]

  • Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: High Precision Rule Based PPI Extraction and Per-Pair Basis Performance Evaluation. ACM sixth international workshop on Data and text mining in biomedical informatics (DTMBIO 2012), Maui, Hawaii, USA; Oct. 2012

  • Kyubum Lee, Sunwon Lee, Jaewoo Kang*: SNP Grouping Method Based on PPI Network Information. The 37th Conference of the Korea Information Processing Society, Apr 2012 [Written in Korean]

  • Taewon Joh, Kyubum Lee, Jaewoo Kang*: Comparative analysis of Biomedical Databases and Text mining Technologies. The 34th Conference of the Korea Information Processing Society, Nov 2010 [Written in Korean]

  • Hojun Kim, Seongyeon Won, Seungwoo Gang, Kyubum Lee, Byounggun Kim, Sunkyu Kim, Jaewoo Kang*: Research on Identifying Mutation-Drug Relationship in Biomedical Literature Using Biomedical Context based pre-trained word embedding (KIPS 2017 Spring), Jeju, Korea; April 2017 [Written in Korean]

Posters:

  • Kyubum Lee, Mengyu Xie, Scott D. Cukras, John H. Lockhart, Rodrigo Carvajal, Elsa R. Flores, Christine H. Chung, and Aik-Choon Tan: Comprehensive Oral Cancer Explorer (CORALE): A user-friendly web-based oral cancer data analysis portal. The 11th Annual Moffitt Scientific Symposium; Apr. 28, 2021

  • John H Lockhart, Hayley D Ackerman, Kyubum Lee, Mahmoud Abdalah, Andrew Davis, Nicole Montey, Theresa Boyle, James Saller, Ayensur Keske, Kay Hänggi, Brian Ruffell, Olya Stringfield, Aik Choon Tan, Elsa R Flores: Automated tumor segmentation, grading, and analysis of tumor heterogeneity in preclinical models of lung adenocarcinoma. AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; Jan. 13-14, 2021 [Abstract]

  • John H. Lockhart, Kyubum Lee, Hayley D. Ackerman, Mahmoud Abdulah, Andrew Davis, Nicole Montey, Theresa Boyle, James Saller, Aysenur Keske, Kay Hanggi, Olya Stringfield, Aik Choon Tan and Elsa R. Flores: Spatial genomics coupled with machine learning to identify p53-driven molecular signatures that are predictive of lung adenocarcinoma progression. AACR Virtual Special Conference on Tumor Heterogeneity: From Single Cells to Clinical Impact; September 17-18, 2020 [Abstract]

  • Kyubum Lee, Chih-Hsuan Wei, Livia Famiglietti, Sylvain Poux, Lionel Breuza, Alan Bridge, Ioannis Xenarios and Zhiyong Lu*: Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. ISMB 2018, Chicago, USA; July 2018

  • Kyubum Lee, Byounggun Kim, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sunwon Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan and Jaewoo Kang*: Deep learning of mutation-gene-drug relations from the literature for precision medicine. ISMB/ECCB 2017, Prague, Czech Republic; July 2017 (doi: 10.7490/f1000research.1114641.1) [Poster] [Link]

Talks:

  • Applying Machine Learning to Biomedical Literature Mining: ‘Natural Language Processing & Health’ class, Population Health Sciences, Weill Cornell Medicine; Mar. 2021

  • Machine Learning for Literature Mining: ‘Introduction to Text Mining’ class, FAES, NIH; Mar. 2020

  • Scaling up data curation using machine learning: An application to literature triage in genomic variation resources. CBB Seminar, NCBI, NLM, NIH; Feb. 2019

  • Biomedical Literature Search, Mining and Applications: College of Medicine, Seoul National University; Nov. 2018 [Online talk]

  • Machine-assisted Variant Curation. Biomedical Linked Annotation Hackathon 4 (BLAH4), Kashiwa, Japan; Jan. 2018 [Link]


Education:

Korea University / Data Mining & Information Systems Lab (Advisor: Prof. Jaewoo Kang) - Seoul, Korea

Ph.D. in Computer Science and Engineering (Data Mining and Machine Learning): September 2012 to February 2017

  • Ph.D. Thesis: Text mining approaches for knowledge extraction from biomedical literature

M.S. in Computer Science and Bioinformatics (Bioinformatics): September 2010 to August 2012

B.S. in Computer Science: September 2008 to August 2010

B.S. in Life Science: March 2002 to August 2008 (On leave: 2003–2005, Military Service)