Gavin Ye
Class of 2025
Class of 2025
Generative machine learning models have enormous potential for drug discovery applications. These models can generate new data based on the provided training data. Although the "language" of chemistry is extremely complex, there is a straightforward representation of chemical molecules that enables researchers to design drug molecules using generative language processing models.
With the help of generative machine learning, scientists will be able to discover drugs and vaccines much more efficiently and possibly even prevent future pandemics by discovering a vaccine beforehand. Machine-aided drug discovery decreases the time between the discovery of a new virus. However, one of the main challenges is to have all of the characteristics a drug-design machine learning model needs to perform well: the ability to build accurate molecules, generate molecules that meet multiple constraints, and generate practical molecules that can be synthesized since molecules can be theoretically possible but currently impossible to synthesize. Existing drug design models usually either support optimization for multiple traits, but the generated molecules are too complex to be synthesized, or the model generates synthetically possible molecules that usually satisfy few of the desired traits.
Various language processing models have been tried for designing drug molecules. In one study, an older GPT (Generative Pretrained Transformer) model was trained and optimized for drug design. The model can outperform many other existing models and reach state-of-the-art accuracy. This study inspired me to examine the potential of recently developed generative models. The recent success of OpenAI’s GPT-4 model and Chat-GPT undoubtedly suggests a large potential for how similar GPT models can be applied in drug discovery. Open AI’s pretrained GPT model allows fine-tuning and optimization, allowing training sets and parameters to be given to specialize the GPT model in certain tasks. My project will integrate the newest GPT-4 model with other existing machine learning models, such as the drug-synthesis planning model, to achieve high accuracy for novel molecules that are synthetically possible. Next, I will apply the optimized model to biological targets related to Alzheimer’s disease.