Dataset

The dataset used in this competition is derived from the publicly available Protein-Protein Association (PPA) dataset, specifically the ogbg-ppa dataset. The original dataset consists of 37 classes and is widely used for graph classification tasks.

Dataset Selection and Modification

For this competition:

Subset Selection:
- 6 classes out of the 37 available classes were chosen at random.
- 40% of the original dataset has been randomly selected.
Noise Addition:
- Different levels of symmetric and asymmetric noise were added to the labels.
- Four distinct datasets were created, each incorporating:
- Different percentages of noise in the labels, or
- Different types of noise (symmetric vs. asymmetric).

Dataset Structure

The dataset can be downloaded here.

Folder Organization

Once downloaded, the dataset folder contains four subfolders named A, B, C, and D:

Each subfolder corresponds to a dataset generated using the procedure described above.
Inside each folder, you will find two files:
- train.json.gz: Training set (with labels) for model training.
- test.json.gz: Test set (without labels) for generating predictions.

Page updated

Google Sites

Report abuse