Dataset Description

Datasets Used in This Study

This project leverages two complementary datasets from the LiTraj benchmark suite. These datasets differ in accuracy and scale, enabling a transfer learning framework.

1. nebBVSE122k Dataset (Approximate Physics)

Size: 122,421 lithium vacancy migration hops
Method: BVSE-NEB (Bond Valence Site Energy + Nudged Elastic Band)

This dataset contains migration barriers computed using an approximate electrostatic model. While less accurate than DFT, BVSE allows rapid estimation of barriers at large scale.

Key characteristics:

Each data point corresponds to a single lithium vacancy hop.
Structures are provided as supercells.
A special centroid atom (“X”) marks the diffusion bottleneck.
The dataset includes predefined train/validation/test splits.

Role in this project:

This dataset is used for large-scale pretraining to learn structural representations of migration physics.

2. nebDFT2k Dataset (High-Fidelity Ground Truth)

Size: 1,681 lithium vacancy migration hops
Method: DFT-NEB (Density Functional Theory + Nudged Elastic Band)

This dataset contains high-accuracy migration barriers computed using quantum mechanical DFT calculations.

Key characteristics:

Migration paths are fully relaxed using DFT-NEB.
Forces and energies are computed from first principles.
Structures include centroid representations for GNN input.

Role in this project:

This dataset serves as the high-fidelity benchmark. It is used for:

Fine-tuning pretrained models
Training a scratch baseline
Final evaluation

Data Split Strategy

To ensure fair evaluation:

DFT data was split into 80% training and 20% test.
The test set remained untouched during training.
Both scratch and fine-tuned models were evaluated on the same test set.

This ensures a clean comparison between training strategies.

Page updated

Google Sites

Report abuse