Tutorial - W18
Applying Generative Language Models for Biological Sequence Analysis
Overview
This 3-hour tutorial offers an intensive training session on the use of generative natural language processing, such as transformer-based large language models for biological sequence analysis, combining theoretical understanding with practical exercises. The session will focus on advancements in generative language models—deep learning models adept at reading, summarizing, translating, and generating text, drawing analogy into the analysis of biological sequences. Utilizing models like ProtTrans, which was trained on a vast repository of UniProtKB protein sequences, this tutorial underscores the models' ability to capture the intricate spatial relationships among protein residues, particularly beneficial to answering biological questions related to protein structure and function.
Participants will learn to construct basic machine learning pipelines leveraging deep learning and pre-trained transformer models for probing biological sequences. Starting with an overview of essential Python packages PyTorch, Scikit-Learn, and Transformers, the tutorial progresses to cover the biological underpinnings of protein sequences and their functionalities. It further delves into classical NLP and its latest breakthroughs before exploring how self-supervised attention mechanisms in transformers can be pivoted toward understanding protein sequences. This tutorial aims to impart practical skills and insights into applying cutting-edge models to biomedicine, ensuring attendees leave with a robust framework for analyzing protein sequences using the latest AI/ML technologies. Additionally, the hands-on practices planned under this tutorial are seamlessly transferable to other clinical informatics tasks, such as clinical text mining.
References
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Elnaggar, A., Heinzinger, M., Dallago, C., Rehawi, G., Yu, W., Jones, L., Gibbs, T., Feher, T., Angerer, C., Steinegger, M. and Bhowmik, D., 2021. ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High-Performance Computing. IEEE Transactions on Pattern Analysis and Machine Intelligence.
UniProt Consortium, 2007. The universal protein resource (UniProt). Nucleic acids research, 36(suppl_1), pp. D190-D195.
Learning Objectives
Educational Objectives: By the end of the tutorial, participants will:
Learn the basics and applications of language models in bioinformatics.
Implement Python pipeline of collecting, preprocessing, and vectorizing biological sequence data for analysis.
Implement natural language processing techniques such as RNN and LSTM for biological sequence analysis.
Apply pre-trained transformer models for specific protein sequence analysis tasks.
Target Audience
The target audiences are graduate students, researchers, scientists, and practitioners in both academia and industry who are interested in applications of deep learning, natural language processing, and transformer-based language models in biomedicine and biomedical knowledge discovery. The tutorial is aimed towards entry-level participants with basic knowledge of computer programming (preferably Python) and machine learning (Beginner or Intermediate).
Instructions
Download the folder shared in the link: https://tinyurl.com/yeyyfskh
The required notebooks and data are kept inside the folder.
Outline
SATURDAY - NOV 9, 2024 (1:00 -4:30 PM EDT)
Sessions:
Session 1: Warm-up and Refresher
Session 2: General Introduction to language models.
Session 3 : Unboxing Generative Language Models
Session 4 : Hands-on case studies on applications of transformers in protein sequence analysis
Materials:
Slides: Sarker-W18
Notebooks:
Additional Background:
Python Refresher
Complementary Notebook 0: Colab-Notebook-Python-Rewfresher
Building deep learning models (RNN, LSTM) for sequence analysis
Complementary Notebook 1: Colab-Notebook-RNN
Complementary Notebook 2: Colab-Notebook-LSTM
Feedback
We greatly appreciate your feedback.
Organizers:
1. Bishnu Sarker
Bishnu Sarker is an Assistant Professor of Computer Science and Data Science at Meharry Medical College, Nashville, TN, USA. His research focus is on applying AI, deep learning, natural language processing (NLP), and graph-based reasoning approaches to effectively describe proteins numerically and to infer their functional characteristics from complex, heterogeneous, and interconnected biomedical data. He received his BS from Khulna University of Engineering and Technology, Bangladesh; MS from Sorbonne University, France; and PhD from INRIA, France. During his PhD he spent a winter at MILA - Quebec AI Institute and University of Montreal, Canada, as a visiting researcher with the DrEAM mobility grant from University of Lorraine.
2. Sayane Shome
Sayane Shome is a postdoctoral researcher at Stanford University, USA, as a joint appointment between Dr. Nima Aghaeepour and Dr. Lawrence S. Prince's labs. She completed her PhD. in Bioinformatics from Iowa State University, USA, under the supervision of Dr. Robert L. Jernigan. Her Ph.D. dissertation involved studying various membrane protein systems using computational biophysical methods. Under her postdoctoral fellowship at Stanford, she has been working with computational analysis of different omics datasets and developing AI models using biomedical data focussed on understanding the progression of Bronchopulmonary dysplasia in newborn babies. In addition, she has strong interests in outreach, science communication, and STEM awareness amongst underrepresented groups and has volunteered with various non-profit organizations. She has also been associated with the International Society of Computational Biology (ISCB)-Student Council at leadership levels for several years.
Acknowledgment:
This material is based upon work supported by the National Science Foundation under Grant No. 2302637. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.