Kyubum "Kyu" Lee, PhD
Biomedical AI/ML Scientist
Principal Data Scientist at Amgen
Thousand Oaks, California, USA
E-mail: KYUBUMLEE [at] gmail [dot] com
Summary
Expertise in artificial intelligence and data mining, specifically in machine learning, natural language processing (NLP), and biomedical informatics
Extensive experience and in-depth knowledge of deep learning, statistics, data/text analysis, visualization, data integration, and biomedical data
Spearhead multiple projects collaborating with researchers in the computer science and biomedical fields
- First-author of 9 publications and co-author of 15 publications in peer-reviewed SCI journals
- Built 10 web-based services each with more than 1K users including LitVar, ChimerDB 3.0, HiPub, BEST, TIMEx and BRONCOFluent in Python and vast experience in Scikit-learn, TensorFlow, Keras, NLTK, and various Data science and BioNLP tools
Experience / Education
December 2021 to Present: Principal Data Scientist at Center for Design and Analysis, Amgen, Thousand Oaks, CA, USA
• Next-generation clinical trial design and analysis
- Improving the efficiency of clinical trial and drug development process using ML/NLP
- Clinical trial text analysis using LLMs and BERT-based methods
- Develop a chatbot for clinical trial data and literature
March 2020 to November 2021: Applied Postdoctoral Researcher at Tan Lab, Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, USA. Mentor: Dr. Aik-Choon Tan
• Develop deep learning methods for oral and lung cancer histopathology image analysis
- Deep learning-based (Mask R-CNN) image analysis method for understanding tumor heterogeneity and tumor microenvironment
• Comprehensive ORAL cancer Explorer (CORALE) project: develop a multi-omics clinicopathological data web portal
- Processed 62 multi-omics datasets: 9 RNA-seq dataset alignment, 53 microarray dataset normalization and clinical feature normalization
- Developed bioinformatics analysis and visualization tools
• Participate in oral and lung cancer-related biomedical informatics analysis projects
August 2017 to March 2020: Postdoctoral Researcher at BioNLP Lab, NCBI/NLM/NIH, USA. Mentor: Dr. Zhiyong Lu
• Orchestrated a research project which involved using deep learning for biomedical literature triage – Collaboration with the European Bioinformatics Institute (UK) and the Swiss Institute of Bioinformatics (Switzerland): reduced the manual workload by nearly 70% using machine learning + NLP
• Develop a web-based machine learning platform that provides tools for classifying biomedical literature
• Conduct analyses of publications on precision health to identify human genes of translational value – Collaboration with the Centers for Disease Control and Prevention (CDC)
• Participated in designing the search engine LitVar which retrieves genomic variants in biomedical documents in PubMed and PMC
February to May 2017: Research Professor at Data Mining & Information Systems Lab, Korea University, Seoul, Korea. Mentor: Prof. Jaewoo Kang
• Orchestrated a research project on building a cancer mutation knowledge-base (VarDrugPub) which involved using deep learning for extracting gene-drug-disease-mutation relations from biomedical literature
• Mentoring graduate/undergraduate students
August 2014 to February 2015: Exchange Researcher at Translational Bioinformatics and Cancer Systems Biology Lab, University of Colorado - Anschutz Medical Campus, USA. Mentor: Prof. Aik-Choon Tan
• Built Biomedical Entity Relation Corpus (BRONCO) which contains gene-drug-disease-mutation relations found in biomedical texts
• Extracted gene-drug relations from biomedical texts using NLP tools for creating the Drug Signatures Database (DSigDB)
• Collaborated cross-functionally with researchers in translational bioinformatics and cancer biology, providing advice on biomedical projects
September 2012 to February 2017: Ph.D. Student at Data Mining & Information Systems Lab, Korea University, Seoul, Korea. Mentor: Prof. Jaewoo Kang
• PhD Thesis: Text mining approaches for knowledge extraction from biomedical literature
• VarDrugPub: Extracting Gene-Variant-Drug information from biomedical literature – Web service / Database
• HiPub: An application that translates biomedical texts to networks – Chrome extension
• ChimerDB 3.0 (ChimerPub): A database for fusion genes from biomedical literature – Web service / Database
September 2010 to August 2012: M.S. Student at Data Mining & Information Systems Lab, Korea University, Seoul, Korea. Mentor: Prof. Jaewoo Kang
• M.S. Thesis: Drug-drug interaction analysis using heterogeneous biological information network
Skills / Experience
Machine Learning
Deep learning: Convolutional Neural Networks, Transformers, BERT (BioBERT)
SVM, Random Forests, Decision Tree, Logistic Regression, Ridge/Lasso Regression (Classifiers), KNN, Linear regression
Hierarchical clustering, K-means clustering
Ensemble methods (Majority Voting, Bagging, Stacking)
Statistical approaches
Text mining / NLP
Named-Entity Recognition (NER) / Named-Entity Normalization (NEN)
Information extraction (Relation extraction)
Extracted information to database / network Web-service
Literature triage / Text classification using machine learning (deep learning)
Semi-automated curation (Manual curators + NLP)
Word/Text embedding (Word2vec, Sent2vec, BERT and etc.)
Literature search / Information retrieval
Topic modeling (Latent Dirichlet allocation (LDA))
Regular Expressions
Data handling
Data collection and integration (using various APIs and tools)
Biomedical data resources
Accessing literature databases: PubMed, PMC, bioRxiv, EuropePMC
Other resources: NCBI(API), MeSH, ICD10, UMLS/MetaMap, PubChem, ChEBI, Disease ontology, PubTator, KEGG, STRING db, TCGA, GEO, ArrayExpress, cBioPortal
Data cleaning
Data visualization
SQL, NoSQL (MongoDB, Neo4j)
Biomedical (Multi-omics) data analysis
Transcriptomics data analysis
Survival analysis
Microarray data normalization / RNAseq alignment
Clinical data (term) normalization
Tumor-immune micro-environment analysis using bulk transcriptomics data
Image analysis
Cell segmentation/subtype classification using deep learning (Mask R-CNN)
Tumor-immune micro-environment analysis using histopathology image
Image processing
Others
Python, Tensorflow, Keras, Numpy, Pandas, Scikit-learn, Matplotlib, Seaborn, JupyterNotebook, Git, AWS, and etc.
Links for Recent Projects
LitSuggest: A Web-based System for Literature Recommendation and Curation using Machine Learning (Co-First Author / Developed ML core and Data processing part.) [Link] [Publication in Nucleic Acids Research]
CORALE: Comprehensive ORAL cancer Explorer (First Author / Data processing and Analysis tool developing) [Link] [Publication in progress]
Literature Triage using Machine Learning (First Author) [GitHub] [Publication in PLOS Computational Biology]
VarDrugPub: Variant-Gene-Drug relations Database (First Author) [Link] [Publication in BMC Bioinformatics]
TIMEx: Tumor-immune microenvironment deconvolution web-portal for bulk transcriptomics using pan-cancer scRNA-seq signatures (Participated in data processing) [Link] [Pulbication in Bioinformatics]
LitVar: Search Engine for Genomic Variants in PubMed and PMC [Link] [Publication in Nucleic Acids Research]
HiPub: An application that translates biomedical texts to networks (First Author) [Link] [Publication in Bioinformatics] [AltMetric]
BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations (First Author) [Link] [Publication in Database]
ChimerDB 3.0 (Co-first author, Built a fusion gene DB using Text-mining - ChimerPub) [Link] [Wikipedia] [Publication in Nucleic Acids Research] [PDF]
BEST: Biomedical Entity Search Tool (Actively Involved) [Link] [Publication in PLOS ONE]
BEReX: Biomedical Entity-Relationship eXplorer (Actively Involved) [Link] [Publication in Bioinformatics]
DSigDB: Drug SIGnatures DataBase (Involved in building text-mined drug-gene sets) [Link] [Publication in Bioinformatics]
Other Links:
Awards:
Recognition for Honorary Award (Recognition and Appreciation of Special Achievement): National Library of Medicine, National Institutes of Health, U.S. Department of Health and Human Services; Dec. 2018
Best Paper of the Year Award: Korea University; Feb. 2017
Publications:
Journals:
John H. Lockhart, Hayley D. Ackerman, Kyubum Lee, Mahmoud Abdalah, Andrew John Davis, Nicole Hackel, Theresa A. Boyle, James Saller, Aysenur Keske, Kay Hänggi, Brian Ruffell, Olya Stringfield, W. Douglas Cress, Aik Choon Tan & Elsa R. Flores: Grading of lung adenocarcinomas with simultaneous segmentation by artificial intelligence (GLASS-AI). NPJ Precision Oncology, 2024 [Full Text]
Travis C. Hyams, Ling Luo, Brionna Hair, Kyubum Lee, Zhiyong Lu, and Daniela Seminara: Machine Learning Approach to Facilitate Knowledge Synthesis at the Intersection of Liver Cancer, Epidemiology, and Health Disparities Research. JCO Clinical Cancer Informatics, 2022 [Full Text]
Marco Napoli, Sarah J Wu, Bethanie L Gore, Hussein Abbas, Kyubum Lee, Rahul Checker, Shilpa Dhar, Kimal Rajapakshe, Aik Choon Tan, Min Gyu Lee*, Cristian Coarfa*, Elsa R Flores*: ΔNp63 regulates a common landscape of enhancer associated genes in non-small cell lung cancer. Nature Communications, 2022 [Full Text]
Kyubum Lee†, John H. Lockhart†, Mengyu Xie, Ritu Chaudhary, Robbert J. Slebos, Elsa R. Flores, Christine H. Chung and Aik Choon Tan*: Deep Learning of Histopathology Images at the Single Cell Level. Frontiers in Artificial Intelligence, 2021 († These authors contributed equally) [Abstract] [Full Text]
Alexis Allot†, Kyubum Lee†, Qingyu Chen†, Ling Luo, and Zhiyong Lu*: LitSuggest: A Web-based System for Literature Recommendation and Curation using Machine Learning. Nucleic Acids Research, 2021 († These authors contributed equally) [Full Text] [Database Link]
Mengyu Xie, Kyubum Lee, John H. Lockhart, Scott D. Cukras, Rodrigo Carvajal, Amer A. Beg, Elsa R. Flores, Mingxiang Teng, Christine H. Chung, and Aik Choon Tan*: TIMEx: tumor-immune microenvironment de-convolution web-portal for bulk transcriptomics using pan-cancer scRNA-seq signatures. Bioinformatics, 2021 [Full Text] [Link]
Kyubum Lee, Chih-Hsuan Wei, and Zhiyong Lu*: Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature. Briefings in Bioinformatics, 2020 [Full Text]
Qingyu Chen, Kyubum Lee, Shankai Yan, Sun Kim, and Zhiyong Lu*: BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale. PLOS Computational Biology, 2020. [Full Text]
Kyubum Lee, Mindy Clyne, Wei Yu, Zhiyong Lu*, and Muin Khoury*: Tracking human genes along the translational continuum. npj Genomic Medicine, 2019. [Full Text]
Kyubum Lee, Maria Livia Famiglietti, Aoife McMahon, Chih-Hsuan Wei, Jacqueline Ann Langdon MacArthur, Sylvain Poux, Lionel Breuza, Alan Bridge, Fiona Cunningham, Ioannis Xenarios and Zhiyong Lu*: Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. PLOS Computational Biology, 2018. [Full Text] [Hot paper of the week at NIH Intramural Research News Letter]
Alexis Allot†, Yifan Peng†, Chih-Hsuan Wei, Kyubum Lee, Lon Phan, and Zhiyong Lu*: LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC. Nucleic Acids Research 2018. [Full Text] [Database Link]
Kyubum Lee†, Byounggun Kim†, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sunwon Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan* and Jaewoo Kang*: Deep learning of mutation-gene-drug relations from the literature. BMC Bioinformatics, 2018. DOI: 10.1186/s12859-018-2029-1 [Full Text] [Database Link]
Sangrak Lim, Kyubum Lee, Jaewoo Kang: Drug-drug interaction extraction from the literature using a recursive neural network. PLoS ONE, 2018. DOI: 10.1371/journal.pone.0190926 [Full Text]
Seongsoon Kim†, Donghyeon Park†, Yonghwa Choi†, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang*: A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis JMIR Medical Informatics 2017. DOI: 10.2196/medinform.8751 [Full Text]
Myunggyo Lee†, Kyubum Lee†, Namhee Yu†, Insu Jang†, Ikjung Choi, Pora Kim, Ye Eun Jang, Byounggun Kim, Sunkyu Kim, Byungwook Lee, Jaewoo Kang*, and Sanghyuk Lee*: ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining. Nucleic Acids Research 2017. DOI:10.1093/nar/gkw1083 († These authors contributed equally) [Full Text] [Database Link]
Kyubum Lee, Wonho Shin, Byunggun Kim, Sunwon Lee, Yonghwa Choi, Sunkyu Kim, Minji Jeon, Aik Choon Tan* and Jaewoo Kang*: HiPub: Translating PubMed and PMC Texts to Networks for Knowledge Discovery. Bioinformatics 08/2016; 32(18). DOI:10.1093/bioinformatics/btw511 [Full Text] [Link]
Kyubum Lee, Sunwon Lee, Sungjoon Park, Sunkyu Kim, Suhkyung Kim, Kwanghun Choi, Aik Choon Tan* and Jaewoo Kang*: BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations. Database The Journal of Biological Databases and Curation 04/2016; 2016. DOI:10.1093/database/baw043 [Full Text]
Jocelyn Barbosa, Kyubum Lee, Sunwon Lee, Bilal Lodhi, Jae-Gu Cho, Woo-Keun Seo, Jaewoo Kang*: Efficient quantitative assessment of facial paralysis using iris segmentation and active contour-based key points detection with hybrid classifier. BMC Medical Imaging 12/2016; 16(1). DOI:10.1186/s12880-016-0117-0 [Full Text]
Sunwon Lee†, Donghyeon Kim†, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan, Jaewoo Kang*: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature. PLoS ONE 10/2016; 11(10). DOI:10.1371/journal.pone.0164680 († These authors contributed equally to the work.) [Full Text] [Link]
Minjae Yoo, Jimin Shin, Jihye Kim, Karen A Ryall, Kyubum Lee, Sunwon Lee, Minji Jeon, Jaewoo Kang, Aik Choon Tan*: DSigDB: Drug Signatures Database for Gene Set Analysis. Bioinformatics 05/2015; 31(18). DOI:10.1093/bioinformatics/btv313 [Link] [Full Text]
Woo Keun Seo, Jaewoo Kang, Minji Jeon, Kyubum Lee, Sunwon Lee, Ji Hyun Kim, Kyungmi Oh, Seong Beom Koh: Feasibility of Using a Mobile Application for the Monitoring and Management of Stroke-Associated Risk Factors. Journal of Clinical Neurology 04/2015; 11(2). DOI:10.3988/jcn.2015.11.2.142 [Link]
Minji Jeon, Sunwon Lee, Kyubum Lee, Aik-Choon Tan, Jaewoo Kang*: BEReX: Biomedical Entity-Relationship eXplorer. Bioinformatics 01/2014; 30(1). DOI:10.1093/bioinformatics/btt598 [Link] [Full Text]
Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: On the efficacy of per-relation basis performance evaluation for PPI extraction and a high-precision rule-based approach. BMC Medical Informatics and Decision Making 04/2013; 13(1). DOI:10.1186/1472-6947-13-S1-S7 [Link]
Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: BOSS: Context-enhanced search for biomedical objects. BMC Medical Informatics and Decision Making 04/2012; 12 Suppl 1(Suppl 1). DOI:10.1186/1472-6947-12-S1-S7 [Link]
Hanjun Shin, Ki Hoon Kim, Chihwan Song, Injoon Lee, Kyubum Lee, Jaewoo Kang, Yoon Kyoo Kang*: Electrodiagnosis support system for localizing neural injury in an upper limb. Journal of the American Medical Informatics Association 05/2010; 17(3). DOI:10.1136/jamia.2009.001594 [Link]
Sunwon Lee, Kyubum Lee, Jaewoo Kang*, Jaehoon Choi, Junho Oh: Trends in Personalized Medicine Research. Communications of the Korean Institute of Information Scientists and Engineers, Vol.29, Issue 4, Pages:19-25, Apr 2011 [Written in Korean]
Conferences / Meetings:
Proceedings:
Chih-Hsuan Wei, Kyubum Lee, Robert Leaman, Zhiyong Lu: Biomedical Mention Disambiguation Using a Deep Learning Approach. The 10th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2019), Niagara Falls, NY; Sept 2019 [Link]
Donghyeon Kim†, Sunwon Lee†, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Aik-Choon Tan, Jaewoo Kang*: BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature. The 5th Annual Translational Bioinformatics Conference (TBC 2015), Tokyo, Japan; Sept. 2015 († These authors contributed equally to the work.)
Kyubum Lee, Sunwon Lee, Minji Jeon, Jaehoon Choi, Jaewoo Kang*: Drug-drug interaction analysis using heterogeneous biological information network. IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2012), Philadelphia, USA; Oct. 2012 [Link]
Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang*: High Precision Rule Based PPI Extraction and Per-Pair Basis Performance Evaluation. ACM sixth international workshop on Data and text mining in biomedical informatics (DTMBIO 2012), Maui, Hawaii, USA; Oct. 2012
Kyubum Lee, Sunwon Lee, Jaewoo Kang*: SNP Grouping Method Based on PPI Network Information. The 37th Conference of the Korea Information Processing Society, Apr 2012 [Written in Korean]
Taewon Joh, Kyubum Lee, Jaewoo Kang*: Comparative analysis of Biomedical Databases and Text mining Technologies. The 34th Conference of the Korea Information Processing Society, Nov 2010 [Written in Korean]
Hojun Kim, Seongyeon Won, Seungwoo Gang, Kyubum Lee, Byounggun Kim, Sunkyu Kim, Jaewoo Kang*: Research on Identifying Mutation-Drug Relationship in Biomedical Literature Using Biomedical Context based pre-trained word embedding (KIPS 2017 Spring), Jeju, Korea; April 2017 [Written in Korean]
Posters:
Kyubum Lee, Mengyu Xie, Scott D. Cukras, John H. Lockhart, Rodrigo Carvajal, Elsa R. Flores, Christine H. Chung, and Aik-Choon Tan: Comprehensive Oral Cancer Explorer (CORALE): A user-friendly web-based oral cancer data analysis portal. The 11th Annual Moffitt Scientific Symposium; Apr. 28, 2021
John H Lockhart, Hayley D Ackerman, Kyubum Lee, Mahmoud Abdalah, Andrew Davis, Nicole Montey, Theresa Boyle, James Saller, Ayensur Keske, Kay Hänggi, Brian Ruffell, Olya Stringfield, Aik Choon Tan, Elsa R Flores: Automated tumor segmentation, grading, and analysis of tumor heterogeneity in preclinical models of lung adenocarcinoma. AACR Virtual Special Conference on Artificial Intelligence, Diagnosis, and Imaging; Jan. 13-14, 2021 [Abstract]
John H. Lockhart, Kyubum Lee, Hayley D. Ackerman, Mahmoud Abdulah, Andrew Davis, Nicole Montey, Theresa Boyle, James Saller, Aysenur Keske, Kay Hanggi, Olya Stringfield, Aik Choon Tan and Elsa R. Flores: Spatial genomics coupled with machine learning to identify p53-driven molecular signatures that are predictive of lung adenocarcinoma progression. AACR Virtual Special Conference on Tumor Heterogeneity: From Single Cells to Clinical Impact; September 17-18, 2020 [Abstract]
Kyubum Lee, Chih-Hsuan Wei, Livia Famiglietti, Sylvain Poux, Lionel Breuza, Alan Bridge, Ioannis Xenarios and Zhiyong Lu*: Scaling up data curation using deep learning: An application to literature triage in genomic variation resources. ISMB 2018, Chicago, USA; July 2018
Kyubum Lee, Byounggun Kim, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sunwon Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan and Jaewoo Kang*: Deep learning of mutation-gene-drug relations from the literature for precision medicine. ISMB/ECCB 2017, Prague, Czech Republic; July 2017 (doi: 10.7490/f1000research.1114641.1) [Poster] [Link]
Talks:
Applying Machine Learning to Biomedical Literature Mining: ‘Natural Language Processing & Health’ class, Population Health Sciences, Weill Cornell Medicine; Mar. 2021
Machine Learning for Literature Mining: ‘Introduction to Text Mining’ class, FAES, NIH; Mar. 2020
Scaling up data curation using machine learning: An application to literature triage in genomic variation resources. CBB Seminar, NCBI, NLM, NIH; Feb. 2019
Biomedical Literature Search, Mining and Applications: College of Medicine, Seoul National University; Nov. 2018 [Online talk]
Machine-assisted Variant Curation. Biomedical Linked Annotation Hackathon 4 (BLAH4), Kashiwa, Japan; Jan. 2018 [Link]
Education:
Korea University / Data Mining & Information Systems Lab (Advisor: Prof. Jaewoo Kang) - Seoul, Korea
Ph.D. in Computer Science and Engineering (Data Mining and Machine Learning): September 2012 to February 2017
Ph.D. Thesis: Text mining approaches for knowledge extraction from biomedical literature
M.S. in Computer Science and Bioinformatics (Bioinformatics): September 2010 to August 2012
B.S. in Computer Science: September 2008 to August 2010
B.S. in Life Science: March 2002 to August 2008 (On leave: 2003–2005, Military Service)