In progress, please contribute. Add your name if you contribute!
Contributor: Hao Zhou (zhouhao@4paradigm.com); Wei-Wei Tu (tuweiwei@4paradigm.com)
In progress, please contribute. Add your name if you contribute!
Contributor: Hao Zhou (zhouhao@4paradigm.com); Wei-Wei Tu (tuweiwei@4paradigm.com)
Benchmark Datasets
Basic information:
created on 2021
Comments:
It is developed by the Scientific Machine Learning Research Group at the Rutherford Appleton Laboratory in the UK. It is also one of the most recent and versatile scientific ML benchmarking initiatives.
The benchmark represents scientific problems drawn from material sciences and from environmental sciences, including Diffuse Multiple Scattering (DMS_Structure, Material Sciences), Cloud Masking (SLSTR_Cloud, Environmental Sciences) and Electron Microscopy Image Denoising (EM_Denoise).
DMS_Structure uses machine learning for classifying the structure of multi-phase materials from X-ray scattering patterns, which is a multi-label classification problem. SLSTR_Cloud is to classify each pixel of a set of satellite images as either cloud or as non-cloud (clear sky), which is as binary classification problem. EM_Denoise uses machine learning for removing noise from electron microscopic images.
It still needs to add more datasets in other scientific domains.
Quantitative numbers:
em_graphene_sim : 28GB
dms_sim: 7GB
slstr_cloud_ds1: 180GB
Basic information:
created on 2017
Comments:
MoleculeNet is one of the most famous benchmark datasets for molecular machine learning with over 3.2K GitHub stars. The full collection currently includes over 700,000 compounds tested on a range of different properties. These properties can be subdivided into four categories: quantum mechanics, physical chemistry, biophysics and physiology.
MoleculeNet resolves the lack of a standard benchmark to compare the efficacy of proposed methods for molecular machine learning.
Most datasets have only a few thousand samples, and even the least has about 600 samples.
Quantitative numbers:
Basic information:
PDEBENCH: An Extensive Benchmark for Scientific Machine Learning
created on 2022
Comments:
PDEBench contains a much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems.
PDEBench alleviate the lack of benchmarks for physical dynamics.
In PDEBench, each sample is generated with different parameters, initial conditions, and boundary conditions, and the difficulty of the problem is heavily influenced by the parameters of PDEs.
It still needs to add some parameterized multi-scale problems and multi-physics problems.
Quantitative numbers:
Basic information:
Comments:
The Protein Data Bank (PDB) is the first open-access, molecular data resource
in biology, and it is managed by the Worldwide Protein Data Bank since 2003.
The PDB Core Archive houses 3D atomic coordinates of more than 144 000 structural models of proteins, DNA/RNA, and their complexes with metals and small molecules and related experimental data and metadata.
PDB is universally regarded as a core data resource essential for understanding the functional roles that macromolecules play in biology and medicine.
Quantitative numbers:
Basic information:
created on 2020
Comments:
Matbench contains 13 supervised ML tasks from 10 datasets.
Matbench’s data are sourced from various subdisciplines of materials science, such as experimental mechanical properties (alloy strength), computed elastic properties, computed and experimental electronic properties, optical and phonon properties, and thermodynamic stabilities for crystals, 2D materials, and disordered metals.
The number of samples in each task ranges from 312 to 132,752, representing both relatively scarce experimental materials properties and comparatively abundant properties such as DFT-GGA21 formation energies.
Quantitative numbers:
Basic information:
Comments:
WeatherBench alleviate the lack of a common dataset and evaluation metrics make inter-comparison between studies difficult in the weather
forecasting.
WeatherBench is a benchmark dataset for data-driven medium-range weather forecasting, a topic of high scientific interest for atmospheric and computer scientists alike.
The data is derived from the ERA5 archive that has been processed to facilitate the use in machine learning models.
Quantitative numbers:
It contains 14 variables and 5 constants, and each variable or constant is in the image format.
It provides 5.625° (32x64 grid points), 2.8125° (64x128 grid points) and 1.40525° (128x256 grid points) resolution for the data.
The entire dataset at 5.625° resolution has a size of 191GB. Individual variables amount to around 25GB three-dimensional and 2GB for two-dimensional fields. File sizes for 2.8125° and 1.40525° resolutions are a factor 4 and 16 times larger.