Due 11:59 pm Sun Apr 2
In this project stage your team will perform entity matching (EM). Recall that in Project Stage 1 you have selected two structured data sets. In this project you will find entities across these two datasets that match. Specifically, you will do the following.
First, extract data from these two sources to form two tables A and B. The two tables should have the same schema, and each tuple in each table must describe a single entity (all of the same type). For example, if the entity type is person, then each tuple describes a person, and a possible table schema can be A(name, city, state, zip, phone) (and the same schema for Table B).
Each table must have at least 3000 tuples.
Note: If you don't want to use the datasets collected in Stage 1, you can use a movie-track-song dataset that we provide below.
Next, you will use Magellan, an EM tool (developed here at UW) to match the two tables. Specifically, you should do the following:
Download and install Magellan on your machine. See the Magellan's homepage (the Python package for Magellan is called py_entitymatching). There is a detailed instruction on that homepage on how to install Magellan.
Read the how-to guide (also available on that homepage).
Now read through the user manual of Magellan to get a feel for how to use it. There is a set of Jupyter notebooks that should give you an idea of how to use Magellan.
Now use Magellan to match the two tables A and B. Pay particular attention to Section 3.8 (selecting and debugging a matcher: a general procedure) in the how-to guide, because that's what you will have to follow. Your goal should be to create a matcher that achieves precision of at least 90% and recall as high as possible.
What to Submit?
Recall that in the project stage you have two tables A and B, each with at least 3000 tuples. You have done blocking on these tables to obtain a set of candidate tuple pairs C. Then you have taken a sample S from C and label it. Finally, you have used the labeled set to develop a matcher that achieves precision of at least 90% and recall as high as possible.
Submit the following on your team's project page, by the deadline:
Provide a link to a DATA directory that stores both tables A and B. They should be stored in two files, each file storing a table in CSV format. We should be able to browse these files to examine the tables. There should be a README file in the same directory that explains the tables (e.g., the meaning of the attributes) and lists the number of tuples per table.
NOTE: If you use the Silicon Valley dataset provided by us (the movie-track-song dataset), and use the tables "as is", then you don't have to describe them, but do say that you use the tables "as is". If you use this dataset but downsample to work with smaller tables, then provide the description of those tables, as discussed above.
In the same directory
provide a file that lists all tuple pairs that survive the blocking step.
provide a file that lists all tuple pairs in the sample you have taken, together with the labels, one label per each tuple pair. This file is referred to as file G in Sec 3.8 of the how-to guide.
provide two files that describe the sets I and J, respectively (see Sec 3.8 of the how-to guide for an explanation of these files).
Provide a link to a CODE directory that stores all of your code (this directory must also be browsable). In this directory, also provide a Jupyter notebook that describes your blocking and matching process (see below for details).
Provide a link to a pdf document that contains at least the following:
Describe the type of entity you want to match, briefly describe the two tables (e.g., where did you obtain these tables), list the number of tuples per table.
Describe the blocker that you use and list the number of tuple pairs in the candidate set obtained after the blocking step.
List the number of tuple pairs in the sample G that you have labeled.
For each of the six learning methods provided in Magellan (Decision Tree, Random Forest, SVM, Naive Bayes, Logistic Regression, Linear Regression), report the precision, recall, and F-1 that you obtain when you perform cross validation for the first time for these methods on I.
Report which learning based matcher you selected after that cross validation.
Report all debugging iterations and cross validation iterations that you performed. For each debugging iteration, report (a) what is the matcher that you are trying to debug, and its precision/recall/F-1, (b) what kind of problems you found, and what you did to fix them, (c) the final precision/recall/F-1 that you reached. For each cross validation iteration, report (a) what matchers were you trying to evaluate using the cross validation, and (b) precision/recall/F-1 of those.
Report the final best matcher that you selected, and its precision/recall/F-1.
It is important to note that all precision/recall/F-1 numbers asked for in the aboves are supposed to be numbers obtained via CV on the set I. Do not yet use set J.
Now report these numbers:
For each of the six learning methods, train the matcher based on that method on I, then report its precision/recall/F-1 on J.
For the final best matcher Y selected, train it on I, then report its precision/recall/F-1 on J.
Report approximate time estimates: (a) to do the blocking, (b) to label the data, (c) to find the best matcher.
Provide a discussion on why you didn't reach higher recall, and what you can do in the future to obtain higher recall.
BONUS POINTS: provide comments on what is good with Magellan and what is bad, that is, as users, what else would you like to see in Magellan. Are there any features/capabilities that you would really like to see being added? Any bugs? Depending on how detailed and helpful these comments are, you can get bonus point from 1-10 (which will help with the final grade, not just with the project).
Jupyter notebook: recall that in the CODE directory you should have a Jupyter notebook. This is because once you are done with your task, you often must communicate what you did and what you found to other fellow data scientists, the rest of the company, and the world. And usually a good way to do so is to produce a Jupyter notebook. This step asks you to practice this skill. Here's the instruction for creating the Jupyter notebook.
Helping a Research Lab in Silicon Valley
If you do not want to use the datasets you have collected in Stage 1, you can use the following dataset that is provided by a research lab in Silicon Valley for Project Stage 3. Specifically,
Download the dataset here.
The above link contains two files: (1) datasets.tar.gz, and (2) readme. The file datasets.tar.gz contains the three CSV files (movies.csv, tracks.csv, and songs.csv). The readme file includes a brief description of the dataset.
Try to match between movies and tracks (each track describes a song in a movie), or tracks and songs, or do both.
If you use this dataset, you can ask Pradap for help along the way, as he's motivated to match this dataset. You would also help a research lab in the Valley as they are motivated to get this matching done.
Misc Information
Check out MadDSI for tools that can help you. Please email us if you find some tools that can help and are not listed here.