Out Sun Apr 21, due Fri May 10 (with individual steps having closer due dates, please see below).
In this project stage your team will use CloudMatcher to do blocking and matching for the two tables A and B that you have created in project stage 2. Here are the steps. Please follow them very carefully.
Make sure that the tables A and B are in well-formed CSV format. See examples of such CSV tables under "The 784 Data Sets" in this page.
Identify your group number and know which CloudMatcher instances your group is supposed to use. I will send an email on this to the class mailing list.
Use that CloudMatcher instance to perform a trial run of matching two tables of restaurants: Fodors and Zagats. (You will have to submit the result for this matching). I will email a link to the tutorial. Make sure that you follow the tutorial very closely.
Use that CloudMatcher instance to match the two tables A and B of your group.
Once you are done with Step 4, you will try to estimate the precision and recall of CloudMatcher on your dataset. Here is the instruction for how to do so, and here's the Jupyter notebook referenced in the instruction.
What to Submit?
On your team's project page, please submit the following:
By the end of Wed Apr 24, please list
the user ID that you used to log into the CloudMatcher cluster.
the project ID that you used to create a project in which you matched the two tables Fodors and Zagats.
a screen shot of the very last screen after you are done with the matches. Here is an example of such a screen.
Please list the above on the project page of your group, under headline "Matching Fodors and Zagats".
Once your team is done with the above, you should proceed to use CloudMatcher to block and match the two tables A and B you have extracted in Project Stage 2. Here are the deadlines. Please read carefully.
By the end of Fri Apr 26, 13 project teams will submit the results for the blocking step for their dataset (we describe the results below). I WILL EMAIL THE CLASS MAILING LIST ON WHICH TEAM SHOULD SUBMIT WHEN.
By the end of Sat Apr 27, the remaining 13 project teams will submit the results for the blocking step for their dataset (we describe the results below).
By the end of Wed Apr 30, 13 project teams will submit the results for the matching step for their datasets (we describe the results below).
By the end of Thur May 1st, the remaining 13 teams will submit the results for the matching step for their datasets (we describe the results below).
Here's how to submit:
TO SUBMIT THE RESULTS FOR THE BLOCKING STEP:
Your team should use CloudMatcher to do blocking on the two tables A and B that you have extracted in Project Stage 2. Once blocking is done, you should list the following:
the user ID that you used to log into CloudMatcher.
the project ID that you used to create a project in which you blocked and matched the two tables A and B.
a screen shot of the top ten blocking rules that CloudMatcher has learned. Here is an example of such a screen.
Please list the above on the project page of your group, under headline "Blocking Results".
TO SUBMIT THE RESULTS FOR THE MATCHING STEP:
Your team should continue to use CloudMatcher to match the two tables A and B that you have extracted in Project Stage 2. Once matching is done, you should list the following:
the user ID that you used to log into CloudMatcher.
the project ID that you used to create a project in which you blocked and matched the two tables A and B.
a screen shot of of the very last screen after you are done with the matches. Here is an example of such a screen.
Please list the above on the project page of your group, under headline "Matching Results".
TO SUBMIT THE RESULTS FOR THE STEP OF ESTIMATING THE PRECISION AND RECALL
Under headline "Estimating accuracy" in your project homepage:
Provide a link to the files "Prediction list", "Candidate set", "Table A", and "Table B" that you have downloaded from CloudMatcher.
List the size of the candidate set C. If the size is less than or equal to 500, provide a link to the file L, which is the set C together with all the labels that you have manually provided. List the precision and recall that you compute using L and the method 'estimate_PR' (of the Jupyter notebook). Note: you can list these information directly on the Web page. There is no need to provide a pdf file.
If the size of the candidate set C is greater than 500, then state so, and then
provide a pdf file that discusses each iteration of computing density of the matches and using new blocking rules to reduce the candidate set size. In the pdf file, if you have used any blocking rules, then describe those in detail so that we can replicate your code.
provide link to where your blocking code is.
provide link to the final reduced set of candidate tuple pairs.
provide link to the set of 400 tuple pairs that you have sampled and manually labeled.
list the precision and recall you have obtained using the above 400 tuple pairs and the Jupyter notebook.
You need to submit the above results for "Estimating accuracy" by 11:59pm Fri May 10. If you need any extension, please email and let AnHai know.