Research‎ > ‎Resources‎ > ‎

Graph Dataset

From http://www.eecs.wsu.edu/mgd/gdb.html

Below we provide information and pointers to datasets that are either already represented as a graph, or are relational in nature and lend themselves to a graph representation. Ultimately, we plan to evolve this resource into a collection of graph datasets that can be used as a testbed or benchmark for graph-based algorithms.

Before we list the datasets, we note that issues of file format and semantics will need to be addressed in order to make this collection useful for comparisons.

  • File Format. As yet, there is no agreed-upon standard for representing graphs, and each researcher seems to have their own home-grown file format. One possibility is to use the emerging XML standard called GraphML.
  • Semantics. Even though two researchers may use the same source data to construct their graphs, the resulting graph representations may vary (e.g., in chemical domains, one represents a bond as an undirected edge, while another uses a labeled vertex with two directed edges). We can, of course, leave these issues to the individual researcher or define some standards for graph representation (perhaps conceptual graphs would offer a well-defined semantics for these graphs).

Mutagenesis

The Mutagenesis data describes several chemical compounds and classifies them as mutagenic or not mutagenic. A Prolog format version of the data is available, which consists of 188 examples in a regression-friendly format (mutagenesis-f.pl) and a regression-unfriendly format (mutagenesis-u.pl).

Another version of the mutagenesis data is available as part of the 2000 PAKDD Challenge. This data is in the Structure Data File (SDFile) format, which is popular in chemistry.

Toxicology

The Predictive Toxicology Challenge (PTC) data consists of information about chemical compounds known to cause or not cause cancer in rats and mice. The data is available at the above link.

Proteins

The authors have developed a program that converts the atomic structure within Protein Data Bank (PDB) files into SUBDUE graph format. The source code and sample data are available here. More protein data is available from the Protein Data Bank (PDB).

Citations

Citation graphs have become a popular domain for graph-based data mining. We provide pointers to two sets of citation data that can be converted to graph form.

  • The 2003 KDD Cup was based on citation information from the field of high-energy physics between the years 1992 and 2003. This data is available here.
  • Andrew McCallum's Cora bibliographic citation data is available here.

Collaborative Filtering

The MovieLens database represents a recommender system version of collaborative filtering. The data contains information about users, movies, and users' ratings of the movies. The data is available here.

Internet Movie Database (IMDb)

The Internet Movie Database (IMDb) contains information about movies, actors, directors, etc. See their page on interfaces for more information on obtaining the data.

The authors have developed a program that extracts movie data from the IMDb and converts it to a SUBDUE-formatted graph. See the readme.txt file and the source code imdb2graph.c.

Web Graphs

Still looking for data on web structure. The authors have developed some web crawlers that can convert portions of the web to a graph, which can be made available upon request.

Synthetic Data

Synthetically-generated graphs offer the ability to control various properties (e.g., size, average degree, etc.). However, generating graphs with specific properties (e.g., power law) is not always easy. We need to answer two questions:

  • What properties of graphs would we like to control for synthetic generation?
  • How can we generate graphs that exhibit one or more of such properties (e.g., graph grammars)?

Here are some existing synthetic graph datasets.

______________________________________________________

Dataset

I have downloaded parts of them to the graphics server, listed as follows:

  • Power-law graphs
    Group Name Type Vertices Edges Description
    social network soc undirected-graph 26 64 A social network derived from a dataset used in the GD 2003 graph drawing competition
    protein interaction bo undirected-graph 1458 1948 protein interaction network
    internet peer.all.020515 undirected-graph 13,579 37,448 Autonomous system peering relationships in a subset of the internet at May 15, 2002
    citations hep-th-citations undirected-graph 27,400 352,021 citation network from the high energy physics literature
    usroads usroads undirected-graph 129,164 330,870 The US road graph for the United States.
    router itdk0304 undirected-graph 192,244 1,218,132 A router graph of the internet from April 2003.
    wikipedia/20051105 20051105 pages current-filtered directed-graph 1,634,989 19,753,078 The filtered wikipedia graph from 5 November 2005.
  • General graphs
    Group Name Type Vertices Edges
    walshaw data undirected-graph 2,851 30,186
    walshaw 3elt undirected-graph 4,720 27,444
    walshaw 4elt undirected-graph 15,606 91,756
    walshaw add20 undirected-graph 2,395 14,924
    walshaw uk undirected-graph 4,824 13,674
    walshaw add32 undirected-graph 4,960 18,924
    walshaw bcsstk33 undirected-graph 8,738 583,166
    walshaw whitaker3 undirected-graph 9,800 57,978
    walshaw crack undirected-graph 10,240 60,760
    walshaw wing_nodal undirected-graph 10,937 150,976
    walshaw fe_4elt2 undirected-graph 11,143 65,636
    walshaw vibrobox undirected-graph 12,328 330,500
    walshaw bcsstk29 undirected-graph 13,992 605,496
    walshaw fe_sphere undirected-graph 16,386 98,304
    walshaw cti undirected-graph 16,840 96,464
    walshaw memplus undirected-graph 17,758 108,392
    walshaw cs4 undirected-graph 22,499 87,716
    walshaw bcsstk30 undirected-graph 28,924 2,014,568
    walshaw bcsstk31 undirected-graph 35,588 1,145,828
    walshaw fe_pwt undirected-graph 36,519 289,588
    walshaw bcsstk32 undirected-graph 44,609 1,970,092
    walshaw fe_body undirected-graph 45,087 327,468
    walshaw t60k undirected-graph 60,005 178,880
    walshaw wing undirected-graph 62,032 243,088
    walshaw brack2 undirected-graph 62,631 733,118
    walshaw finan512 undirected-graph 74,752 522,240
    walshaw fe_tooth undirected-graph 78,136 905,182
    walshaw fe_rotor undirected-graph 99,617 1,324,862
    walshaw 598a undirected-graph 110,971 1,483,868
    walshaw fe_ocean undirected-graph 143,437 819,186
    walshaw 144 undirected-graph 144,649 2,148,786
    walshaw wave undirected-graph 156,317 2,118,662
    walshaw m14b undirected-graph 214,765 3,358,036
    walshaw auto undirected-graph 448,695 6,629,222

Data in use

1. bo graph

This network shows yeast protein interactions. Proteins can have direct or indirect interactions with one another. Indirect interaction refers to being a member of the same functional module (e.g., transcription initiation complex, ribosome) but without directly binding to one another. In contrast, direct interaction refers to two amino acid chains that binds to each other. Obviously, many of these interactions reflect the dynamic state of the cell and are present or absent depending on the particular environment or developmental status of the cell. However, the sum of existing and potential interactions altogether defines the protein network and is ultimately encoded within the genome of a given organism.

The network consists of one big cluster and several isolated clusters. The largest cluster, which is the graph used here, contains 1458 out of 1870 proteins (~78%).

Interesting notes from the nature paper "Lethality and centrality in protein networks":
"Due to its size, a complete map of the network (Fig. 1a), while informative, in itself offers little insight into its large-scale characteristics."
"This indicates that the network of protein interactions in two separate organisms forms a highly inhomogeneous scale-free network in which a few highly connected proteins play a central role in mediating interactions among numerous, less connected proteins."

Reference:

2. hep-th graph

It is a weighted network of coauthorships between scientists posting preprints on the High-Energy Theory E-Print Archive between Jan 1, 1995 and December 31, 1999.

Reference:

3. as-rel.20071008 graph

It is the autonomous systems graph on Oct. 8th 2007, and it is obtained from CAIDA. The original graph is directed, we converted it to undirected graph by ignoring the direction.

Comments