The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models

Slides (pptx, pdf)

Abstract:

Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio-level simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training. Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy. To address this gap, we introduce Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the ωB97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures. There are ~83M unique molecular systems in OMol covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms. In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry.

Bio:

Dr. Samuel M. Blau is a Research Scientist at Berkeley Lab working at the intersection of computational chemistry, materials science, high-performance computing, and machine learning. He received his B.S. in 2012 from Haverford College and his Ph.D. in Chemical Physics from Harvard University in 2017. Sam has pioneered the use of self-correcting molecular simulation workflows to enable the construction of chemical reaction networks describing complex reaction cascades, e.g. those responsible for battery interphase formation and photoresist patterning. Sam's research group also develops novel datasets, representations, and models for machine learning of chemistry and materials as well as methods that leverage ML model speed and differentiability for accelerated scientific discovery.

Summary:

Focus: modeling molecular dynamics for chemistry
Options:
- Molecular dynamics: atom-atom interactions using simple potential functions
  - Relatively fast but limited accuracy
- Density Functional Theory: nuclei, electron interactions
  - Much more computationally intensive, slower, but a lot more accurate
- Machine-learned inter-atomic potentials: trained on DFT calculations
  - Near-DFT accuracy
  - Near-MD speed
Lots of work on training ML models on inter-atomic potential
- Matbench Discovery: https://matbench-discovery.materialsproject.org/
  - MPtrj: dataset of molecular interactions data
  - MACE-MP-0 trained on MPtrj but applicable to a much wider range of chemicals
- OMat24 dataset: much larger, produces much more accurate ML models
  - Meta FAIR chemistry team datasets:
    - Catalysis
    - MOFs
    - Materials
    - Each ~400m core hours: much more compute than any academic/government team can do
- Still missing: molecular chemistry
Omol25
- https://www.faccts.de/omol25-dataset
- https://arxiv.org/abs/2505.08762
- https://huggingface.co/facebook/OMol25
- Collaboration
  - Industry: Meta, Genentech
  - Government: LBNL, LANL
  - Academia
- 6b core hours
- Small molecules, electrolytes, metal complexes, biomolecules (
  - Up to 350 atoms
  - 83 elements
  - Charge -10 to 10
  - Spin multiplicity 1 to 11
- Construction
  - ORCA Computational Chemistry simulation: https://www.faccts.de/orca/
    - Emphasis on accurate computation of the DFT
  - 24%: Recomputed the settings of several prior datasets either directly or by perturbing the initial conditions or molecular structures to sample the space
    - ANI-2x
    - Orbnet Denali
    - Transition-x
    - GEOM
    - RGD1
    - ANI-1xBB
    - MechDBs
    - Solvated Protein Fragments
    - SPICE2
  - 21%: Biomolecules
    - Protein-nucleic acid
    - Protein-ligand pockets, fragments
    - Protein-protein interface, core
    - ML-MD proteins
  - 20%: Metal complexes
    - Reactivity
    - ML-MD metal complexes
    - Architector high-spin, spin-inert, low-spin
  - 35%: Electrolytes
    - Interface clusters
    - Scaled clusters
    - 5A clusters
    - Redoxed clusters
    - Random solvates
    - Reactivity
    - 3A clusters
    - RPMD clusters
    - ML-MD electrolytes
UMA: Universal Model for Atoms: https://ai.meta.com/research/publications/uma-a-family-of-universal-models-for-atoms/
- Trained on all the data that FAIR chemistry team has put out
- Different datasets have different tasks
  - Single-task models are more accurate than a multi-task model
  - But, adding a merged mixture of linear experts causes the multi-task model to out-perform single-task models
- Mixture of linear experts approach means that the model used for inference is much smaller than the model originally trained: cheap to apply
- Evaluation of UMA model trained on Omol25
  - Separation of Train, Eval, Test sub-datasets to evaluate out-of-distribution performance
  - Metrics: Energy and Force error:
    - Excellent performance for Neutral Organics
    - Good for biomolecules and electrolytes
    - Metal complexes are more challenging and less accurate
    - Many more results….
  - Novel evaluation metrics/tasks
    - Ligand-pocket interaction energy: energy in ligand+pocket - energy in just ligand - energy in just pocket (captures ligand-pocket interaction)
    - Ligand train and conformers
    - Protonation energies: change in energy and geometry as you add/remove a proton
    - IE/EA/spin gap: impact of add/remove electron, change spin on interaction
    - Distance scaling: resolves inter-molecular interactions as you change distances (macro-scale drivers of molecular properties)
    - Excellent accuracy for ligand strain and conformers
    - Good for protonation
    - More work needed for: protein-ligand, IE/EA, Spin gap, Distance scaling
- Rowan benchmarks: https://benchmarks.rowansci.com/
  - UMA is much more accurate than prior models
  - For many chemical systems there is no point in running DFT
Model accuracy trends: Omol25 is sufficiently large that as models train for longer accuracy keeps improving
Coming up:
- Extending dataset:
  - d-block intermediate spins,
  - more diverse heavy main group, noble gases
  - Molecular crystals
  - Polymers
- More information, test set, evaluation tasks
- Public leaderboard