Data Mining Datasets

Labeled Unweighted Transactional Graph Datasets

Seven bio- and chemo-informatics datasets and two social network datasets. (download) (original source)
AIDS antiviral Screening Data. (download), (Raw Data)
Cancer datasets. (original source)

Labeled Weighted Transactional Graph Datasets

[weight is a function of label + synthetic weight]

Cancer Dataset: MCF-7. (normal weight distribution) (negative exponential weight distribution)
Cancer Dataset: P388. (normal weight distribution) (negative exponential weight distribution)
Cancer Dataset: Yeast. (normal weight distribution) (negative exponential weight distribution)

If you use these datasets, please cite us:

Bibtex:

@inproceedings{islam2018wfsm,

  title={WFSM-MaxPWS: An Efficient Approach for Mining Weighted Frequent Subgraphs from Edge-Weighted Graph Databases},

  author={Islam, Md Ashraful and Ahmed, Chowdhury Farhan and Leung, Carson K and Hoi, Calvin SH},

  booktitle={Pacific-Asia Conference on Knowledge Discovery and Data Mining},

  pages={664--676},

  year={2018},

  organization={Springer}

Labeled Weighted Transactional Graph Datasets

[weight is not a function of label + synthetic weight]

Compound_Graph
- Normal Distribution (download) (distribution curve)
- Positively Skewed Normal Distribution (download) (distribution curve)
- Negatively Skewed Normal Distribution (download) (distribution curve)

Weighted Call Graph

[weight is not a function of label + real weight]

About

Original Source
Each node represent a function in call graph
Each directed edge(u,v) represent a call from u to v

Dataset Conversion Method

Node label is set considering most frequent opcode in the function['mov', 'call', 'lea', 'jmp', 'push', 'add', 'xor', 'cmp', 'int3', 'nop', 'pushl', 'dec', 'sub', 'insl', 'inc','jz', 'jnz', 'je', 'jne', 'ja', 'jna', 'js', 'jns', 'jl', 'jnl', 'jg', 'jng']
Edge weight is calculated by taking average of endpoint node's total opcode calls
Edges are unlabeled

Goodware (download)

Number of graphs : 546
Mean number of nodes : 648.1
Mean degree : 3.3
Median degree : 2.7
Maximum degree : 10.1
Number of isolated nodes : 130812
Mean of isolated nodes : 239.6
Number of self loops : 0

Malware (download)

Number of graphs : 815
Mean number of nodes : 871.5
Mean degree : 3.6
Median degree : 3.7
Maximum degree : 34.4
Number of isolated nodes : 231990
Mean of isolated nodes : 284.7
Number of self loops : 0

Weighted Sequential Datasets

1. SIGN (weight is a function of item)

2. LEVIATHAN (weight is a function of item)

3. FIFA (weight is a function of item)

4. Synthetic Dataset (from spmf website, positively and negatively skewed weight distribution)

Google Sites

Report abuse