Abstract:
AI-accelerated synthesis is an emerging field that uses machine learning algorithms to improve the efficiency and productivity of chemical and materials synthesis. Modern machine learning models, such as (large) language models, can capture the knowledge hidden in large chemical databases to rapidly design and discover new compounds, predict the outcome of reactions, and help optimize chemical reactions. One of the key advantages of AI-accelerated synthesis is its ability to make vast chemical data accessible and predict promising candidate synthesis paths, potentially leading to breakthrough discoveries. Overall, AI is poised to revolutionize the field of organic synthesis, enabling faster and more efficient drug development, catalysis, and other applications.
Bio:
Philippe Schwaller joined EPFL as a tenure-track assistant professor in the Institute of Chemical Sciences and Engineering in February 2022. He leads the Laboratory of Artificial Chemical Intelligence, which works on AI-accelerated discovery and synthesis of molecules. Philippe is a core PI of the NCCR Catalysis, a Swiss centre for sustainable chemistry research, education, and innovation, and a co-lead of the foundation models for sciences pillar in the Swiss AI initiative. He belongs to a new generation of scientists with a broad set of skills – in his case, a combination of chemistry, materials science, computer science, and experimental research.
Before EPFL, Philippe worked for five years at IBM Research. He simultaneously completed an MPhil in Physics (University of Cambridge) and a PhD in Chemistry and Molecular Sciences (University of Bern). He also holds a BSc and MSc degree in Materials Science and Engineering (EPFL).
Summary:
Focus: accelerating the molecule/materials design cycle
Design: what molecule to make?
How to make it
Test
Chemical data sources
Need: chemical reaction space (how to make molecules)
Published literature: extensive but not accessible
Experiment digital lab books
Simulations (highly usable but limited to the types of reactions each model can support)
Patents (valuable but has errors)
Daniel Lowe and Roger Sayle has text-mined reactions from patents
SMILES: linear representation of molecular graphs (tree hierarchy with back-edges to form cycles)
ML for reaction predictions:
MolecularTransformer can be used to map reaction precursors to products
https://github.com/pschwllr/MolecularTransformer
Currently best approaches use transformers
Retrosynthesis:
Target molecule
Known/available building blocks
Design sequence of reactions to produce target molecule
Typically done by specifying reaction rules and searching over the space to reach the target
ML: RoboRXN
Multi-step synthesis planning
Molecular transformer for Retro and Forward steps
Transformer predicts entire recipe with all the actions (stir, filter, etc.) that one can give to a robotic platform
Use of general purpose LLMs to for chemical tasks (above was specialized models)
Moving from encoder-decoder to decoder-only GPT models
Many computational chemistry tools on github. Hard to set up and use
Aim: bridge the gap between computational and experimental chemistry
Generic LLMs are bad at chemistry; ChemCrow extends them using chemical tools
https://github.com/ur-whitelab/chemcrow-public
LLM uses existing specialized tools to solve chemical problems
Example: automated synthesis
Plan and execute synthesis of an insect repellent
Find the chemical to synthesize
Generic name => SMILES => molecular graphs
Run reaction planner to get recipe
Execute recipe on robot
Example: molecular discovery
Given experimental data that describes molecule’s properties
Use ChemCrow to discover the molecule consistent with the data
Example: Safety tools
Interact with tool to ask the dangers from using various chemicals and likely outcomes of usage scenarios
Automates synthesis is not yet a solver problem
Supply chain/robotics challenges
Weak synthesis planning models
Real organic molecules are much more complex than current planning tools are capable of
Bayesian optimization for reactions
Working to figure out the granularity of the way molecules are describes one-hot, DFT
BoChemian: LLM embeddings of the text that describes reaction procedures
Generative De Novo Molecule design
Distribution learning (Transfer learning)
Goal directed Learning (Reinforcement Learning)
Generation using a high-fidelity oracle
Oracle: high fidelity/cost simulation
Protein design algorithm can call oracle a limited number of times
Sample efficiency is critical: learn from few observations
Approaches:
Augmented memory: combines data augmentation with experience replay
Saturn: sample-efficient de novo design
Synthesizability constrained generation
TANGO: enforcing building blocks in synthesis routes
New reward function
Tanimoto similarity
Substructure match
Accelerates search for high-value molecules
FSscore: Chemist’s personalized feasibility score
Different chemists find it easier to synthesize different molecules
Can fine-tune model to align with chemist preferences
Can replace human expert with a make-on-demand molecule library