Description of the 784 Data Sets

How Were the Data Sets Generated?

The data science class CS 784 Fall 2015 had 24 project teams. Each team created one data set, resulting in the 24 data sets listed below, one per table row. To create a data set, a team

  • selected two Websites and crawled HTML pages from those sites,

  • extracted tuples from the HTML pages to create two tables, one per site,

  • performed blocking on these tables (to remove obviously non-matched tuple pairs), producing a set of candidate tuple pairs,

  • took a random sample of pairs from the above set and labeled the pairs in the sample as "matched" / "no-matched".

Finally, the team used the labeled sample to develop a matcher. For more details, see Section 6.1 of the Magellan paper.

What Do the Columns of the Table Mean?

    • Sources: the Websites from which the data was crawled.

    • HTML Files: the number of HTML pages obtained from each Website (available in tar.gz format). This corresponds to the output of Step 1 described above.

    • Input Tables: the number of tuples in each table (in csv format). Note that the team did not extract tuples from *all* HTML pages (we only required each table to have at least 2,500 tuples). This corresponds to Step 2.

    • Candidate Set: A table C that stores all tuple pairs that survive blocking (in csv format). This is the output of Step 3.

    • Labeled Data: A table L which is table C with an extra column storing the label "matched" / "no-matched" (n csv format). This is the output of Step 4.

    • The last column of the table: the entire data set (except HTML pages) in tar.gz format.

A Description of Candidate Set Tables C

Let the two original tables (to be matched) be A and B. Recall that when doing blocking on these two tables, we obtain a table C of candidate tuple pairs (also called a "candidate set"). Conceptually, each tuple pair (x,y) consists of a tuple x from Table A and a tuple y from Table B.

The first five rows of Table C describe metadata (see more below). They can be ignored. The 6th row describes the attributes (of the subsequent rows). From the 7th row until the end of Table C, each row describes a pair (x,y) as discussed above. Specifically,

  • Each row will start with three attributes _id, ltable._id, rtable._id. These attributes describe an ID of the tuple pair, the ID of tuple x in Table A, and the ID of tuple y in Table B, respectively. For example, if x is tuple with ID 36 in Table A, y is tuple with ID 123 in Table B, then the row describing the pair (x,y) may start with "18, 36, 123", where "18" is an ID that uniquely identifies this tuple pair in Table C. (Note that here "ltable" refers to Table A and "rtable" refers to Table B.)

  • In principle, each row in Table C does not have to contain any other attribute, because knowing ltable._id and rtable._id would allow us to retrieve the component tuples x and y from Tables A and B, respectively.

  • That said, each team may decide to add more attributes to each row, to make it more "human readable". The added attributes are of the form ltable.FOO or rtable.BAR, which refer to attribute FOO of Table A or attribute BAR of Table B, respectively. How many of these "extra" attributes to be added was up to each team to decide.

The only thing left to do is to describe the first five lines of Table C (though as we said earlier, you can safely ignore these lines). These lines contain metadata for an earlier version of the Magellan system.

  • #key=_id: attribute _id is a key of Table C.

  • #ltable=POINTER and #rtable=POINTER: these two lines can be ignored.

  • #foreign_key_ltable=ltable._id: attribute ltable._id in Table C is a foreign key referencing Table A.

  • #foreign_key_rtable=rtable._id: attribute rtable._id in Table C is a foreign key referencing Table B.

A Description of Labeled Data Tables L

Recall that once a team has obtained Table C, the team will manually label each tuple pair in C as "matched" / "no-matched". The result is Table L, which looks similar to Table C, except that it has an extra attribute at the end called "gold". For each tuple pair in L, this attribute is 1 if the pair is considered a match, and 0 otherwise.