Abstract:
Machine learning (ML) models hold the promise of transforming atomic simulations by delivering quantum chemical accuracy at a fraction of the computational cost. Realization of this potential would enable high-throughout, high-accuracy molecular screening campaigns to explore vast regions of chemical space and facilitate ab initio-level simulations at sizes and time scales that were previously inaccessible. However, a fundamental challenge to creating ML models that perform well across molecular chemistry is the lack of comprehensive data for training. Despite substantial efforts in data generation, no large-scale molecular dataset exists that combines broad chemical diversity with a high level of accuracy. To address this gap, we introduce Open Molecules 2025 (OMol25), a large-scale dataset composed of more than 100 million density functional theory (DFT) calculations at the ωB97M-V/def2-TZVPD level of theory, representing billions of CPU core-hours of compute. OMol25 uniquely blends elemental, chemical, and structural diversity including: 83 elements, a wide-range of intra- and intermolecular interactions, explicit solvation, variable charge/spin, conformers, and reactive structures. There are ~83M unique molecular systems in OMol covering small molecules, biomolecules, metal complexes, and electrolytes, including structures obtained from existing datasets. OMol also greatly expands on the size of systems typically included in DFT datasets, with systems of up to 350 atoms. In addition to the public release of the data, we provide baseline models and a comprehensive set of model evaluations to encourage community engagement in developing the next-generation ML models for molecular chemistry.
Bio:
Dr. Samuel M. Blau is a Research Scientist at Berkeley Lab working at the intersection of computational chemistry, materials science, high-performance computing, and machine learning. He received his B.S. in 2012 from Haverford College and his Ph.D. in Chemical Physics from Harvard University in 2017. Sam has pioneered the use of self-correcting molecular simulation workflows to enable the construction of chemical reaction networks describing complex reaction cascades, e.g. those responsible for battery interphase formation and photoresist patterning. Sam's research group also develops novel datasets, representations, and models for machine learning of chemistry and materials as well as methods that leverage ML model speed and differentiability for accelerated scientific discovery.
Summary:
Focus: modeling molecular dynamics for chemistry
Options:
Molecular dynamics: atom-atom interactions using simple potential functions
Relatively fast but limited accuracy
Density Functional Theory: nuclei, electron interactions
Much more computationally intensive, slower, but a lot more accurate
Machine-learned inter-atomic potentials: trained on DFT calculations
Near-DFT accuracy
Near-MD speed
Lots of work on training ML models on inter-atomic potential
Matbench Discovery: https://matbench-discovery.materialsproject.org/
MPtrj: dataset of molecular interactions data
MACE-MP-0 trained on MPtrj but applicable to a much wider range of chemicals
OMat24 dataset: much larger, produces much more accurate ML models
Meta FAIR chemistry team datasets:
Catalysis
MOFs
Materials
Each ~400m core hours: much more compute than any academic/government team can do
Still missing: molecular chemistry
Omol25
Collaboration
Industry: Meta, Genentech
Government: LBNL, LANL
Academia
6b core hours
Small molecules, electrolytes, metal complexes, biomolecules (
Up to 350 atoms
83 elements
Charge -10 to 10
Spin multiplicity 1 to 11
Construction
ORCA Computational Chemistry simulation: https://www.faccts.de/orca/
Emphasis on accurate computation of the DFT
24%: Recomputed the settings of several prior datasets either directly or by perturbing the initial conditions or molecular structures to sample the space
ANI-2x
Orbnet Denali
Transition-x
GEOM
RGD1
ANI-1xBB
MechDBs
Solvated Protein Fragments
SPICE2
21%: Biomolecules
Protein-nucleic acid
Protein-ligand pockets, fragments
Protein-protein interface, core
ML-MD proteins
20%: Metal complexes
Reactivity
ML-MD metal complexes
Architector high-spin, spin-inert, low-spin
35%: Electrolytes
Interface clusters
Scaled clusters
5A clusters
Redoxed clusters
Random solvates
Reactivity
3A clusters
RPMD clusters
ML-MD electrolytes
UMA: Universal Model for Atoms: https://ai.meta.com/research/publications/uma-a-family-of-universal-models-for-atoms/
Trained on all the data that FAIR chemistry team has put out
Different datasets have different tasks
Single-task models are more accurate than a multi-task model
But, adding a merged mixture of linear experts causes the multi-task model to out-perform single-task models
Mixture of linear experts approach means that the model used for inference is much smaller than the model originally trained: cheap to apply
Evaluation of UMA model trained on Omol25
Separation of Train, Eval, Test sub-datasets to evaluate out-of-distribution performance
Metrics: Energy and Force error:
Excellent performance for Neutral Organics
Good for biomolecules and electrolytes
Metal complexes are more challenging and less accurate
Many more results….
Novel evaluation metrics/tasks
Ligand-pocket interaction energy: energy in ligand+pocket - energy in just ligand - energy in just pocket (captures ligand-pocket interaction)
Ligand train and conformers
Protonation energies: change in energy and geometry as you add/remove a proton
IE/EA/spin gap: impact of add/remove electron, change spin on interaction
Distance scaling: resolves inter-molecular interactions as you change distances (macro-scale drivers of molecular properties)
Excellent accuracy for ligand strain and conformers
Good for protonation
More work needed for: protein-ligand, IE/EA, Spin gap, Distance scaling
Rowan benchmarks: https://benchmarks.rowansci.com/
UMA is much more accurate than prior models
For many chemical systems there is no point in running DFT
Model accuracy trends: Omol25 is sufficiently large that as models train for longer accuracy keeps improving
Coming up:
Extending dataset:
d-block intermediate spins,
more diverse heavy main group, noble gases
Molecular crystals
Polymers
More information, test set, evaluation tasks
Public leaderboard