L'OMAR 

Sarker's Lab of Omics Mining and Algorithmic Reasoning

 

AI for living equal, longer and healthier.

Projects

Developing a Knowledge Graph Driven Integrative Framework for Explainable Protein Function Prediction via Generative Deep Learning

Funding: PI;  NSF #2302637;  HBCU Exellence in Research  ;  2023-2026. 

Members and Affiliates:   

Relevant Publications:

Brief Description. 

Proteins are the building blocks of life performing multitudes of functions that include but not limited to catalyzing reactions as enzymes, participating in the body?s defense mechanism as antibodies, forming structures and transporting important chemicals. The interactions among proteins describe the molecular mechanism of diseases, and convey potentially important insights about the disease prevention, diagnosis, and treatments. Therefore, functional characterization of proteins is crucial to helping understand life, diseases, and developing novel treatments for life threatening illness. Despite recent advancements, predicting protein function remains an open problem due to low performance, lack of explainable outcomes, and irreproducible research dissemination highlighting the need for improved methodologies leveraging the recent proliferation of biomedical data about proteins. The objective of this research is to design, implement, and evaluate a protein function prediction pipeline using a novel generative deep learning approach powered by heterogeneous knowledge graph to address the challenge of multi-omics data integration, explainable function prediction, and reproducibility. The research will be carried out through three interrelated tasks: 1) investigation of a novel generative deep learning model on knowledge graph; 2) integration of multi-omics features through large language model; and, 3) development of reproducible software. Successful completion of this project will lead to a robust, more accurate, reproducible and explainable protein function prediction pipeline. The project will create new education and outreach opportunities to greatly strengthen the training and research activities in computational biology leveraging modern AI technologies at Meharry Medical College, a leading HBCU. Meharry dominantly enrolls African American students. More than 90% data science students at Meharry are African Americans and majority are women. This project will increase STEM education awareness, impact, and opportunity to the women and minority students at Meharry to excel in AI/ML, quantitative genomics and data science research. The reproducible open-source software will greatly facilitate broader scientific community working to improve protein function prediction.

Develop the SDoH NLP pipeline that will mine and extract SDoH factors related to SUDs 


Funding:  RCMI Supplement; 2023-2024

Members and Affiliates: 

Brief Description

Substance use disorders (SUDs) are a major public health issue that has recently more than doubled in prevalence among Americans in the last few years, from affecting 20 million in 2018 to 46.3 million Americans in 2021. The SUD use is associated with myriad poor health outcomes and comorbidities. Those affected by SUD disproportionately experience negative social determinants of health (SDoH), including inadequate access to safe housing, transportation, education, employment opportunities, and nutritious foods. These SDoH are connected with low self-esteem, self-efficacy, and failed attempts to SUD abstinence. Collecting and integrating SDoH information as part of patients’ electronic health records (EHR) for clinical modeling could help uncover patient experiences and behaviors related to SDoH; this detection would help inform clinical care and potentially reduce health disparities among persons with SUD. While majority of SDoH are embedded in unstructured free text, thus, patients with SUDs are significantly under-detected in EHR. Additionally, the SUD diagnosis is under represented in structured based ICD10 coding. There for, an automated and more accurate approach is needed to extract SDoH and identify SUDs. Natural language processing (NLP) can unlock the information conveyed in clinical narratives, thus playing a critical role in real-world studies. NLP is a sub-domain of data science that includes tools and techniques for unstructured text data analytics (extraction, retrieval, and modeling) by leveraging the advances in machine learning. Methods and tools are being developed to facilitate such extractions, However, these tools are still under study for cohort-specific samples and unstructured text analytics is complex. In this proposal, we aim to address the challenge of SDoH extraction and SUDs identification from unstructured clinical notes or patient surveys to generate a consistent framework that can aid in identifying, understanding, treating, and predicting SUDs and associated outcomes (e.g. relapse). Meharry Medical College provides healthcare services for underserved populations and Elam Mental Health Center (EMHC) provide residential treatment program for individuals with cooccurring mental health disorders and SUDs. To advance health disparity studies and improve understanding of patient characteristics of the SUD patients from the underserved population, it is high priority to explore machine learning tools to extract SDoH related factors and identify SUDs diagnosis.