(Large) Language models for accelerated chemical discovery and synthesis

Abstract:
AI-accelerated synthesis is an emerging field that uses machine learning algorithms to improve the efficiency and productivity of chemical and materials synthesis. Modern machine learning models, such as (large) language models, can capture the knowledge hidden in large chemical databases to rapidly design and discover new compounds, predict the outcome of reactions, and help optimize chemical reactions. One of the key advantages of AI-accelerated synthesis is its ability to make vast chemical data accessible and predict promising candidate synthesis paths, potentially leading to breakthrough discoveries. Overall, AI is poised to revolutionize the field of organic synthesis, enabling faster and more efficient drug development, catalysis, and other applications.

Bio:
Philippe Schwaller joined EPFL as a tenure-track assistant professor in the Institute of Chemical Sciences and Engineering in February 2022. He leads the Laboratory of Artificial Chemical Intelligence, which works on AI-accelerated discovery and synthesis of molecules. Philippe is a core PI of the NCCR Catalysis, a Swiss centre for sustainable chemistry research, education, and innovation, and a co-lead of the foundation models for sciences pillar in the Swiss AI initiative. He belongs to a new generation of scientists with a broad set of skills – in his case, a combination of chemistry, materials science, computer science, and experimental research.

Before EPFL, Philippe worked for five years at IBM Research. He simultaneously completed an MPhil in Physics (University of Cambridge) and a PhD in Chemistry and Molecular Sciences (University of Bern). He also holds a BSc and MSc degree in Materials Science and Engineering (EPFL).

Summary:

Focus: accelerating the molecule/materials design cycle
- Design: what molecule to make?
- How to make it
- Test
Chemical data sources
- Need: chemical reaction space (how to make molecules)
- Published literature: extensive but not accessible
- Experiment digital lab books
- Simulations (highly usable but limited to the types of reactions each model can support)
- Patents (valuable but has errors)
  - Daniel Lowe and Roger Sayle has text-mined reactions from patents
SMILES: linear representation of molecular graphs (tree hierarchy with back-edges to form cycles)
ML for reaction predictions:
- MolecularTransformer can be used to map reaction precursors to products
  https://github.com/pschwllr/MolecularTransformer
- Currently best approaches use transformers
Retrosynthesis:
- Target molecule
- Known/available building blocks
- Design sequence of reactions to produce target molecule
- Typically done by specifying reaction rules and searching over the space to reach the target
- ML: RoboRXN
  - Multi-step synthesis planning
  - Molecular transformer for Retro and Forward steps
  - Transformer predicts entire recipe with all the actions (stir, filter, etc.) that one can give to a robotic platform
Use of general purpose LLMs to for chemical tasks (above was specialized models)
- Moving from encoder-decoder to decoder-only GPT models
- Many computational chemistry tools on github. Hard to set up and use
- Aim: bridge the gap between computational and experimental chemistry
- Generic LLMs are bad at chemistry; ChemCrow extends them using chemical tools
  - https://github.com/ur-whitelab/chemcrow-public
  - LLM uses existing specialized tools to solve chemical problems
  - Example: automated synthesis
    - Plan and execute synthesis of an insect repellent
    - Find the chemical to synthesize
    - Generic name => SMILES => molecular graphs
    - Run reaction planner to get recipe
    - Execute recipe on robot
  - Example: molecular discovery
    - Given experimental data that describes molecule’s properties
    - Use ChemCrow to discover the molecule consistent with the data
  - Example: Safety tools
    - Interact with tool to ask the dangers from using various chemicals and likely outcomes of usage scenarios
Automates synthesis is not yet a solver problem
- Supply chain/robotics challenges
- Weak synthesis planning models
- Real organic molecules are much more complex than current planning tools are capable of
Bayesian optimization for reactions
- Working to figure out the granularity of the way molecules are describes one-hot, DFT
- BoChemian: LLM embeddings of the text that describes reaction procedures
- https://neurips.cc/virtual/2023/78776
Generative De Novo Molecule design
- Distribution learning (Transfer learning)
- Goal directed Learning (Reinforcement Learning)
- Generation using a high-fidelity oracle
  - Oracle: high fidelity/cost simulation
  - Protein design algorithm can call oracle a limited number of times
  - Sample efficiency is critical: learn from few observations
  - Approaches:
    - Augmented memory: combines data augmentation with experience replay
    - Saturn: sample-efficient de novo design
Synthesizability constrained generation
- TANGO: enforcing building blocks in synthesis routes
  - New reward function
    - Tanimoto similarity
    - Substructure match
- Accelerates search for high-value molecules
FSscore: Chemist’s personalized feasibility score
- Different chemists find it easier to synthesize different molecules
- Can fine-tune model to align with chemist preferences
- Can replace human expert with a make-on-demand molecule library