Artificial intelligence for synthetic organic and analytical chemistry
Abstract:
Artificial intelligence and machine learning have become important components of the computational toolbox that can be used to advance chemical research and discovery. In this talk, I will discuss our group’s work advancing AI/ML as it applies to the broad subfields of synthetic organic chemistry and analytical chemistry. I will describe several approaches to facilitate decision-making during synthesis planning and reaction development, including the long-standing task of computer-aided retrosynthetic analysis. Though most research in “predictive chemistry” focuses on applying known reactivity to new substrates, ongoing work has also started to show promise for reaction discovery. I will also describe our recent work in analytical chemistry, specifically using tandem mass spectrometry data for structure elucidation of unknown small molecule metabolites. A pervasive theme of our research is the use of domain expertise to inform modeling, from formulating chemistry challenges as statistical learning problems to designing new neural network architectures uniquely suited to chemistry data.
Bio:
Connor W. Coley is the Class of 1957 Career Development Professor and an Assistant Professor at MIT in the Department of Chemical Engineering and the Department of Electrical Engineering and Computer Science. He received his B.S. and Ph.D. in Chemical Engineering from Caltech and MIT, respectively, and did his postdoctoral training at the Broad Institute. His research group at MIT works at the interface of chemistry and data science to develop models that understand how molecules behave, interact, and react and use that knowledge to engineer new ones, with an emphasis on therapeutic discovery. Connor is a recipient of C&EN’s “Talented Twelve” award, Forbes Magazine’s “30 Under 30” for Healthcare, Technology Review’s 35 Innovators Under 35, the NSF CAREER award, the ACS COMP OpenEye Outstanding Junior Faculty Award, the Bayer Early Excellence in Science Award, the 3M NTFA, and was named a Schmidt AI2050 Early Career Fellow and a 2023 Samsung AI Researcher of the Year.
Summary
Focus: Small organic molecules (useful and versatile)
Challenge: Chemical space is vast
Molecular discovery: complex multi-objective optimization
Typically driven by human intuition
Tasks:
Predicting chemical properties, including reactivity
Ideating new molecular structures
Balancing objectives
Research threads:
AI for synthetic organic chemistry, medicinal chemistry, analytical chemistry
Foundational capabilities: chemistry-tailored neural nets, data sharing, autonomous chemistry labs
History of key chemical tasks
Computer-aided retrosynthesis: Compute programs that explore recipes from an expert-encoded rules/heuristics/constraints
Explain reactivity trends: data-driven analysis of relation between physical conditions and experimental outcomes
Predicting spectra (how molecules look to sensors): rule-based analyses
Synthesis planning: how we access (new) molecules
Input: product to synthesize
Output: reactants, intermediaries, conditions
Typical approach: start with product and try to reverse it until we get to chemicals that we can purchase
Use libraries of valid chemical transformation rules
Produced by chemical vendors
Expert encoded rules in software (https://www.synthiaonline.com/)
Generative models that hypothesize possible transformations
Mine databases of historical reactions
Critical to create a canonical representation of molecules and reactions
Strings (e.g. SMILES)
Structural “fingerprints”
Descriptions of constituent molecules
Graphs & graph edits (requires atom mapping)
Condensed graph of reaction (requires atom mapping)
Synthesis constrains the space of chemicals, transformations and all influences we can access easily
Every transformation requires environmental conditions (solvents, additives, concentrations, temperature, reaction time, etc.)
Different approaches focus on different levels of detail
Approach:
Learning transformations rules for reactions
From databases of known reactions
Representation: graphs of atoms + covalent bonds
Graph neural networks: learn about the behavior of each atom based on its connective structure
Limitations
No 3D structure (not that important for small molecules)
Ignore chirality of molecules
Some covalent bond details are not represented
Ignore interactions beyond covalent bonds
Ignore Atropisomerism
Database Learning process
Find core transformation from database
Add related neighbor reactions
Use set of reactions to drive a retrosynthesis search
Neural net process
Train model to convert from products to reactants
SMILES->SMILES
Graph->Graph
Graph->SMILES
Apply repeatedly within a retrosynthesis search loop
Many algorithms to search very large space
Monte Carlo tree search, best-first, etc.
RL is usable but challenging because the space of moves is dynamic
Search is vulnerable to hallucination where predicted transformations are wrong and lead it down wrong paths
Simulations can be used to check these rules but they are not yet ready to be used reliably (hard to set up, computationally expensive, inaccurate)
Reaction condition recommendation as fill-in-the-blank
Embedding model: learns vector embedding of reagents based on their function. Groups them in ways that mimic structural relationships.
ASKCOS: https://askcos.mit.edu/
Suite of synthesis planning modules
Chemoinformatics & ML
Tasks: Retrosynthesis, condition recommendation, reaction product prediction, reaction classification, atom mapping, selectivity prediction, solvation prediction
35k users, 15 companies
Challenges:
Complexity of synthetic targets is changing: more complex molecules and synthesis pathways are needed for modern use-cases
Data-driven search programs generate many pathway ideas but what do we do experimentally?
Need to score more promising options
E.g. feasibility, impurity, greenness, yield, flow compatibility, scalability, cost
The source of data from which reactions are sourced affects ability to evaluate reactivity
Diverse in substrates/reactions but not conditions: papers, patents
Diverse in conditions but not in substrates/reactions: high-throughput experimentation
Need techniques that cover both
Open Reaction Database: encourages data sharing across teams
https://open-reaction-database.org/
Database
Data structure
Different organization approach for community
Summary:
AI/ML has broad relevance for chemistry
Old goals, ongoing new approaches
ASKCOS suite of tools
Overall, great opportunity for supervised learning