The Magellan Data Repository

This page lists all data sets available for various data integration / data wrangling tasks. Some of these data sets have been created by our group. Some have been collected from other websites or research groups. If you use the data in this repository, you can cite using the following bib item:

@misc{magellandata,

title = {The Magellan Data Repository},

howpublished = {\url{https://sites.google.com/site/anhaidgroup/projects/data}},

author = {Das, Sanjib and Doan, AnHai and G. C., Paul Suganthan and Gokhale, Chaitanya and Konda, Pradap and Govind, Yash and Paulsen, Derek},

institution = {University of Wisconsin-Madison}}

This will help others to obtain the same data sets and replicate your experiments.

The 784 Data Sets for EM

These 24 data sets were created by students in the CS 784 data science class at UW-Madison, Fall 2015, as a part of their class project. While the data was originally created for entity matching purposes, it can also be used to do experiments on other tasks, such as wrapper construction, data cleaning, visualization, etc. More details.

Some results on these data sets were reported in our VLDB-16 paper.

Links

Description of the 784 Data Sets

The Corleone Data Sets for EM

Some results on these data sets were reported in our SIGMOD-14 paper and SIGMOD-17 paper.

Links

Sample blocking rules created for the Corleone data sets

The Falcon Data Sets for EM

Some results on these data sets were reported in our SIGMOD-17 paper.

Links

Sample blocking rules created for the Falcon data sets

The DeepMatcher Data Sets for EM (Matching Step)

These data sets are available here. Some results on these data sets were reported in our SIGMOD-18 paper.

Domain Science Data Sets for EM

These data sets came about while working with domain scientists. Some of them were discussed in our SIGMOD-19 paper and EDBT-19 paper.

- Drug mapping data sets
- UMETRICS data sets
- JCI data
- Water data

The 839 Data Sets for EM

These data sets were created to evaluate CloudMatcher.

The NYC Data Sets for EM

These three data sets were used to evaluate CloudMatcher. We do not have golds for these data sets, but we do have some clustering results.

- Hospital (HOSPITAL, ADDRESS, CITY, STATE, ZIP_CODE, PHONE)
- Fx (FIRM_PERSON_NAME, ADDRESS1, CITY, STATE, POSTAL_AREA)
- Sars* (NAME, ADDRESS, CITY, STATE, ZIP, SSN)

The GreenBay Data Sets for EM (Blocking Step)

The data sets in this section were originally used for blocking experiments. We collected the original data sets, then pre-processed them, such as making the schemas of the two table to be the same, and reformatting the data sets to follow the same common format (so that we can easily run experiments with all data sets). We also tried to compile statistics about each data set (such as the number of missing values in each column) so that we can use them to understand the experimental results.

The common format that we used is here.

LaCrosse*
Survey*
Citeseer-DBLP (Big Citation)
- DBLP-ACM
- DBLP-Scholar (Small Citation)
- Fodors-Zagats
- Songs

Product Data (Structured)

- Abt-Buy (name, desc, price)
- Amazon-Google (title, desc, manufacturer, price)
- Walmart-Amazon (title, brand, price, short-desc, long-desc, model, etc.)
- WDC
- Clothing*
- Electronics*
- Home*
- Tools*
- Clothing5K* (name, brand, manufacturer, 20 more attributes)
- Electronics5K*

Product Data (Textual; whole product is a big blob of text, often title plus long description)

- Clothing5K_textual*
- Electronics5K_textual*

Utility Scripts

The GreenBay Data Sets for EM (Matching Step)

Miscellaneous Data Sets for EM

Other Data Set Repositories for EM

UCI data sets - Collection of data cleaning and entity resolution data sets.
Benchmark data sets for entity resolution
- RIDDLE - Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty.
- EMBench - Entity matching benchmark data set generator.
- WDC - Products for Entity matching
- SIGMOD-2020 Entity Matching Contest

The Illinois Data Sets for Schema/Ontology Matching

These data sets were used in AnHai Doan's PhD thesis and in a few subsequent papers at UIUC during 2002-2006 for schema/ontology matching research. Recently they have been used again for schema matching research.

Here's the old link

The Smurf Data Sets for String Matching

These data sets were used in our VLDB-19 paper. They are available here.

Page updated

Google Sites

Report abuse