useful-stuff‎ > ‎

Data Repository

This page lists all data sets available for various data integration / data wrangling tasks. Some of these data sets have been created by our group. Some have been collected from other websites or research groups. If you use the data in this repository, you can cite using the following bib item: 

           @misc{magellandata,
           title = {The Magellan Data Repository},
           howpublished = {\url{https://sites.google.com/site/anhaidgroup/projects/data}},
           author = {Das, Sanjib and Doan, AnHai and G. C., Paul Suganthan and Gokhale, Chaitanya and Konda, Pradap},
           institution = {University of Wisconsin-Madison}}

This will help others to obtain the same data sets and replicate your experiments.
        
The 784 Data Sets

These 24 data sets were created by students in the CS 784 data science class at UW-Madison, Fall 2015, as a part of their class project. While the data was originally created for entity matching purposes, it can also be used to do experiments on other tasks, such as wrapper construction, data cleaning, visualization, etc. More details

 IDNameDomainSourcesHTML FilesInput TablesCandidate SetLabeled Data.tar.gz
 ABABABCL
 1Restaurants1RestaurantsZomatoYelp3013513530135882781044502.6M
 2BikesBikesBikedekhoBikewale134889963478590028009450426K
 3Movies1MoviesRotten TomatoesIMDB9497743773906407780796006.9M
 4Movies2MoviesIMDBTMD1003189671003110017114881740018M
 
 5
Movies3MoviesIMDBRotten Tomatoes3091312529603093637983993.0M
 
 6
Movies4MoviesAmazonRotten Tomatoes30263429524163915402841210M
 7Restaurants2RestaurantsZomatoYelp769140576960389710630444628K
 8ElectronicsElectronicsAmazonBest Buy426050014259500182383339520M
 9MusicMusiciTunesAmazon Music48755619690655923586925382.3M
 10Restaurants3RestaurantsYelpYellow Pages9958287989947287874313074007.1M
 11 CosmeticsCosmeticsAmazonSephora2115253564431102636034408966K
 12Ebooks1EbooksiTuneseBooks631111094170122802518383108910M
 13Ebooks2EbooksiTuneseBooks6761336116974280241365240010M
 14BeerBeerBeer AdvocateRate Beer100327443453000433496145088M
 15Books1BooksAmazonBarnes & Noble35073509350635082017374449K
 16Books2BooksGoodreadsBarnes & Noble398840373967370040293961.7M
 17AnimeAnimeMy Anime ListAnime Planet3192211
40014000
1383443933.0M
 18Books3BooksBarnes & NobleHalf30223099302230991287450381K
 19Movies5MoviesRoger EbertIMDB3450682535566913504373581K
 20Books4BooksAmazonBarnes & Noble867599599836995841984501.3M
 21Restaurants4RestaurantsYellow PagesYelp3866131184052235278400487K
 22Books5BooksAmazonBarnes & Noble78224996299929992569783988.0M
 23CitationsCitationsGoogle ScholarDBLP3673122124185882623141785M
 24Baby ProductsBaby ProductsBabies 'R' UsBuy Buy Baby50991100750851071811855400645K


The Corleone Data Sets

 IDNameDomainSourcesInput TablesMatches.tar.gz Acknowledgements
 ABABM 
 1RestaurantsRestaurantsFodorsZagats53333111225K[1]
 2ProductsProductsWalmartAmazon255422074115413M 
 3 CitationsCitationsDBLPGoogle Scholar26166426353473.9M[2] 


The Falcon Data Set

IDNameDomainSourcesInput TablesMatches.tar.gzComments
ABABM
1SongsMusicMillion Songs DataMillion Songs Data10000001000000129202358MThis is a case of self join.
2CitationsCitationsCiteseerDBLP18239782512927558787227MGolden matches were generated using heuristic rules.

Citeseer data is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

We have cleaned up the data such as removing rows with many attributes missing, keeping only a subset of attributes.

If you share or redistribute the data, please adhere to the license terms.


Miscellaneous Data Sets

IDNameDomainSourcesInput TablesMatches.tar.gz Comments
ABABM 
1MoviesMoviesIMDBOMDB11322622302426-403M OMDB table has an attribute "imdbID" which may be used to generate the golden matches.

OMDB data is licensed under CC BY-NC 4.0 License. If you share or redistribute the data, please adhere to the license terms.


Other Data Set Repositories

UCI data sets  - Collection of data cleaning and entity resolution data sets.
RIDDLE - Repository of Information on Duplicate Detection, Record Linkage, and Identity Uncertainty.
EMBench - Entity matching benchmark data set generator.