Below we provide information and pointers to datasets that are either
already represented as a graph, or are relational in nature and lend
themselves to a graph representation. Ultimately, we plan to evolve this
resource into a collection of graph datasets that can be used as a testbed
or benchmark for graph-based algorithms.
Before we list the datasets, we note that issues of file format and
semantics will need to be addressed in order to make this collection useful
- File Format. As yet, there is no agreed-upon standard for
representing graphs, and each researcher seems to have their own home-grown
file format. One possibility is to use the emerging XML standard called GraphML.
- Semantics. Even though two researchers may use the same source
data to construct their graphs, the resulting graph representations may
vary (e.g., in chemical domains, one represents a bond as an undirected
edge, while another uses a labeled vertex with two directed edges). We can,
of course, leave these issues to the individual researcher or define some
standards for graph representation (perhaps conceptual graphs would offer
a well-defined semantics for these graphs).
The Mutagenesis data describes several chemical compounds and classifies
them as mutagenic or not mutagenic. A Prolog format version of the data is
available, which consists of 188 examples in a regression-friendly
format (mutagenesis-f.pl) and a
regression-unfriendly format (mutagenesis-u.pl).
Another version of the mutagenesis data is available as part of the 2000
PAKDD Challenge. This data is in the Structure
Data File (SDFile) format, which is popular in chemistry.
Toxicology Challenge (PTC) data consists of information about chemical
compounds known to cause or not cause cancer in rats and mice. The data is
available at the above link.
The authors have developed a program that converts the
atomic structure within Protein Data Bank (PDB) files into SUBDUE graph format. The source code
and sample data are available here. More
protein data is available from the Protein Data Bank (PDB).
Citation graphs have become a popular domain for graph-based data
mining. We provide pointers to two sets of citation data that can be
converted to graph form.
- The 2003 KDD Cup was based on citation information from the field of
high-energy physics between the years 1992 and 2003. This data is available
- Andrew McCallum's Cora bibliographic citation data is available here.
The MovieLens database represents a recommender system version of
collaborative filtering. The data contains information about users,
movies, and users' ratings of the movies. The data is available here.
Internet Movie Database (IMDb)
The Internet Movie Database (IMDb)
contains information about movies, actors, directors, etc. See their
page on interfaces
for more information on obtaining the data.
The authors have developed a program that extracts movie data from the IMDb
and converts it to a SUBDUE-formatted
graph. See the readme.txt file and the source
Still looking for data on web structure. The authors have developed some
web crawlers that can convert portions of the web to a graph, which can be
made available upon request.
Synthetically-generated graphs offer the ability to control various
properties (e.g., size, average degree, etc.). However, generating graphs
with specific properties (e.g., power law) is not always easy. We need to
answer two questions:
- What properties of graphs would we like to control for synthetic
- How can we generate graphs that exhibit one or more of such properties
(e.g., graph grammars)?
Here are some existing synthetic graph datasets.
I have downloaded parts of them to the graphics server,
listed as follows:
- Power-law graphs
||A social network derived from a dataset used in the GD
2003 graph drawing competition |
||protein interaction network |
||Autonomous system peering relationships in a subset of
the internet at May 15, 2002 |
||citation network from the high energy physics literature
||The US road graph for the United States. |
||A router graph of the internet from April 2003.
||20051105 pages current-filtered
||The filtered wikipedia graph from 5 November 2005.
- General graphs
Data in use
1. bo graph
This network shows yeast protein interactions. Proteins can have direct or
indirect interactions with one another. Indirect interaction refers to being a
member of the same functional module (e.g., transcription initiation complex,
ribosome) but without directly binding to one another. In contrast, direct
interaction refers to two amino acid chains that binds to each other. Obviously,
many of these interactions reflect the dynamic state of the cell and are present
or absent depending on the particular environment or developmental status of the
cell. However, the sum of existing and potential interactions altogether defines
the protein network and is ultimately encoded within the genome of a given
The network consists of one big cluster and several isolated clusters. The
largest cluster, which is the graph used here, contains 1458 out of 1870
Interesting notes from the nature paper "Lethality and centrality in protein
"Due to its size, a complete map of the network (Fig. 1a),
while informative, in itself offers little insight into its large-scale
"This indicates that the network of protein
interactions in two separate organisms forms a highly inhomogeneous scale-free
network in which a few highly connected proteins play a central role in
mediating interactions among numerous, less connected proteins."
2. hep-th graph
It is a weighted network of coauthorships between scientists posting
preprints on the High-Energy Theory E-Print Archive between Jan 1, 1995 and
December 31, 1999.
3. as-rel.20071008 graph
It is the autonomous systems graph on Oct. 8th 2007, and it is obtained from
CAIDA. The original graph is directed, we
converted it to undirected graph by ignoring the direction.