The dataset used in this competition is derived from the publicly available Protein-Protein Association (PPA) dataset, specifically the ogbg-ppa dataset. The original dataset consists of 37 classes and is widely used for graph classification tasks.
For this competition:
Subset Selection:
- 6 classes out of the 37 available classes were chosen at random.
- 40% of the original dataset has been randomly selected.
Noise Addition:
- Different levels of symmetric and asymmetric noise were added to the labels.
- Four distinct datasets were created, each incorporating:
- Different percentages of noise in the labels, or
- Different types of noise (symmetric vs. asymmetric).
The dataset can be downloaded here.
Once downloaded, the dataset folder contains four subfolders named A, B, C, and D:
Each subfolder corresponds to a dataset generated using the procedure described above.
Inside each folder, you will find two files:
- train.json.gz: Training set (with labels) for model training.
- test.json.gz: Test set (without labels) for generating predictions.