Gavin Ye
Class of 2024
Class of 2024
1st place in Mathematics & Computer Science, New York-Metro Junior Science and Humanities Symposium Semifinals ‘24
5th place, NYC JSHS Regional Finals; part of the 5-person NYC delegation to the 62nd National JSHS
1st place in Software & Robotics, Terra NYC STEM Fair ‘24 Finals; one of 13 projects from NYC to advance to ISEF Finals
2nd place in Computational Biology and Bioinformatics, International Science and Engineering Fair 2024
Paper published in the peer-reviewed Journal of Computer-Aided Molecular Design
Drug discovery is one of the most time-consuming and costly aspects of developing a drug. It is estimated to take about 10–15 years, with a cost of $1.4 billion per drug discovered and approved. In simpler terms, it is impossible to enumerate all possible synthesizable molecules to test for potential drug effectiveness. Machine learning (ML) has emerged as one of the most promising tools in drug discovery and can speed up this process. Because molecules can be represented in “languages” that an algorithm can interpret, it is possible to retool language processing ML models such as GPT models for drug design. Traditional non-large language models for drug design from previous studies often generate nonsensical (invalid) representations that do not represent actual molecules (like ChatGPT producing gibberish). Thus, my goal was to use GPT to design drug candidates that are both highly effective in targeting a specified drug target and chemically valid.
Before I could train a GPT model for generating effective molecules, I needed a way to evaluate the drug effectiveness (a.k.a efficacy) of any molecule quickly, as it would be impractical to synthesize every machine-designed molecule throughout the entire training process. Thus, I designed my own efficacy evaluation model. Then, I trained my GPT drug design model to design similar drug-like molecules as candidates designed by humans from the dataset. Finally, I used my trained efficacy evaluation model to optimize my drug design GPT model for designing higher efficacy molecules using reinforcement learning. I used the amyloid-precursor protein (APP), a promising drug target for Alzheimer's disease, as a case study for using GPT for drug design. However, the same methodology can be applied to transfer my GPT drug design model for targeting a different protein using a different dataset.
My drug efficacy evaluation model is 2.3 times better in accuracy and is more data efficient compared to models from previous studies. Almost all (99.2%) of the designed molecules by my drug design model are highly effective, and almost all of them are more effective than the average effectiveness of the molecules from the training dataset. In addition, all of the designed molecules are still chemically valid, and novel, as the designed molecules do not exist in the dataset. Future studies can take inspiration from OpenAI’s methods and have human chemists provide feedback directly to the drug design model. The GPT-designed drug candidate molecules exhibit similar properties as the ones from the dataset, and thus future studies can leverage this phenomenon to make patented drugs accessible by generating similar ones. All of these implications from results and future research extensions have the potential to transform drug discovery.
Most difficult part of your research project?
Designing and implementing the drug efficacy optimization process in a short period of time (in August, before school starts). My original idea was to try preserving the differentiability of the designed molecules and pass. However, this was not possible as the efficacy evaluation model takes a different, combined representation that has a molecule’s chemical property calculated; conversions between representations are not differentiable operations. In other words, I had to delete my old implementation mid-way through and re-implement a new one.
What influence did the older ASR classmates have on you?
During last year’s symposium, Raihana (ASR class of '22) suggested to try adding explanability to the drug-synthesis model (my proposed methodology at the time was slightly different). Although I did not implement a drug-synthesis model in the end, I found a way to add explainability to the drug effectiveness evaluation model. To the best of my and my mentor’s knowledge, this has never been done especially for a neural network drug effectiveness evaluation model. (Thanks, Raihana!)
Last year I spent a lot of time talking and creating manim (Python) math animations with Lucas (ASR class of '23). This inspired how I explain and present equations on my slides.
Funny anecdote from ASR
Being the first team to escape the Escape Room last school year as we were trapped in a biology lab and Ms. Bertram, our biology teacher, was on our team. During the escape room game, Ms. Bertram was just teaching us what each tools is. "Here’s the centrifuge! That's the incubator!!"
What's a misconception that people have about ASR or ASR students, and what's the truth?
It's that ASR students get an insurmountable amount of homework. It indeed takes a lot of time to do research. But in reality, it's often not traditional “homework” assigned by a teacher or mentor, but self-motivated goals written by the students themselves. Of course, there are still a few very intimidating, important assignments with deadlines, but still, the motivation and agency come from each ASR student. This flexibility and agency combined with effective planning (which Mr. Yashin teaches) minimize the stress or dread one typically feels in completing typical “homework” or assignments. And according to some ASR alumni, all of these research and productivity skills are extremely useful even in college.
It does take that amount of time to do science research, but it also takes a lot of time to get good at playing golf, video games, or any other tasks. Like these tasks, research time is intrinsically motivated and should be enjoyable. For instance, although my last summer was filled with research methodology work, I enjoyed the experience of programming and training my own GPT model. It wasn't just some typical assigned "homework" that people do to get it over with.