Two weeks, due 11:59 pm Tue Nov 8 (NOW DUE SUN NOV 13)
In this project stage you will write code to do blocking; this takes the two tables A and B and outputs a table C of tuple pairs judged likely to match. Table C must be in the format described at the end of this page. The above code will most likely use a set of rules. We will talk more about this in the class.
Note: blocking is NOT the same as matching. The goal of blocking is just to get rid of as many tuple pairs judged obviously not matched as possible. For example, given two tables A and B describing people, each having 1000 tuples, I don't want to consider matching all 1M tuple pairs in the Cartesian product. So I may want to use some blocking heuristics to reduce the number of pairs I need to consider. One such blocking heuristic would be: only consider tuple pairs that agree on the zipcode. Using this heuristic, I may be able to reduce the number of tuple pairs from 1M to just 150K. Then I can use other blocking heuristics to reduce this number further. Eventually, once I'm done with blocking, let's say I have 48K tuple pairs left. Then I will now consider matching these pairs, i.e., finding out if each pair match or not.
Deliverables for this step
+ code to do blocking: you have to submit a Jupyter file (formerly known as an IPython file) that shows the commands that you use to load in the two tables A and B (obtained in the previous project stage), commands that perform blocking on tables A and B, and commands that save the output table C to disk. If you write external Python programs to do blocking, then submit the code for those programs as well.
Note: you can use a package that we have developed called Magellan to perform blocking. To learn more about Magellan, see this page.
+ table C, the output of the blocking step. This table must be in csv format (as described at the end of this note). If you use Magellan to do blocking on tables A and B, which produces a table C in memory, then you can save this table to disk using the appropriate Magellan command. The table will be saved in the required CSV format.
+ a file called blocking-explanation.pdf that explains in plain English how you performed blocking. Specifically, you have to answer at least the following questions:
How did you develop the final blocker? What blocker did you start with? What problems did you see? Then how did you revise it to come up with the next blocker? In short, explain the *development process*, from the first blocker all the way to the final blocker (that you submit in the Jupyter file).
If you use Magellan, then did you use the debugger? If so, where in the process? And what did you find? Was it useful, in what way? If you do not use Magellan, you can skip this question.
How much time did it take for you to do the whole blocking process?
Report the size of table A, the size of table B, the total number of tuple pairs in the Catersian product of A and B, and the total number of tuple pairs in the table C.
Did you have to do any cleaning or additional information extraction on tables A and B?
Did you run into any issues using Magellan (such as scalability?). Provide feedback on Magellan. Is there anything you want to see in Magellan (and is not there)? If you do not use Magellan, you can skip this question.
Any other feedback is appreciated.
https://sites.google.com/site/anhaidgroup/courses/cs-638-fall-2016/project/stage-3Python Packages You May Want to Use
Regardless of whether you will use Magellan or write your own Python code, you may need to perform string matching (e.g., using edit distance, Jaccard, etc.). We have developed two easy-to-use Python packages for this purpose and we would highly recommend that you consider using them if necessary. The two packages are:
py_stringmatching: this package contains a large set of string similarity measures, such as edit distance, Jaccard, etc.
py_stringsimjoin: this package allows you to match two sets of strings (using a measure such as edit distance etc) quite efficiently.
If you have any questions using them, please contact your mentor (and then me).
Required Format for Table C
If you use Magellan
As mentioned above, if you use Magellan to do blocking, say applying some blockers to tables A and B to produce a table C, then use the appropriate Magellan command to save table C to disk. Given a filename, the command will save two files to disk: (1) contents of table C in a file with the '.csv' extension, and (2) metadata of table C in a file with '.metadata' extension.
In particular, the saved file with '.csv' extension will look like this:
_id,ltable_ID,rtable_ID,ltable_name,rtable_name
0,a1,b1,Kevin Smith,Mark Levene
1,a1,b2,Kevin Smith,Bill Bridge
Essentially, the saved file includes table contents in CSV format. The file with '.metadata' extension will look like this:
#key=_id
#ltable=POINTER
#rtable=POINTER
#fk_ltable=ltable_ID
#fk_rtable=rtable_ID
Note: You don't have to worry about the file with .metadata extension (though you should save it somewhere). The file with .csv extension is already in the format required for this project stage. You can submit it.
If you write you own blocking code
If you do blocking outside Magellan, then table C will be created outside Magellan. In this case, you are responsible for making sure that table C conforms to the following required format.
First, the table C must be in CSV format and it must contain at least three attributes:
an attribute named "ID" that uniquely identifies the tuple pair in C. That is, it will be the key attribute of table C. You will have to generate this (e.g., by keeping a counter and increasing the counter by 1 for each tuple pair, then use the counter value as the ID value for the tuple pair).
an attribute name "ltable_id", where the suffix 'id' is the key attribute of table A. For example, if the key attribute of table A is "ssn", then this attribute name will be "ltable_ssn" (without the quotes, of course).
an attribute name "rtable_id" where the suffix 'id' is the key attribute of table B.
Let's say that table C contains a tuple pair (x,y) where tuple x comes from Table A and has the value 15 for its id attribute, and tuple y comes from table B and has the value 26 for its id attribute. Then the tuple pair (x,y) can be represented in Table C using a tuple like this:
18, 15, 26
where 18 is the id of this tuple pair in Table C (recall that this id value is generated by you), 15 is the id of tuple x in Table A, and 26 is the id of tuple y in Table B. Note that with the id values 15 and 26, we can completely look up x and y (given Tables A and B).
To create the above table, first create a header line describing the above three attributes. That is, the header line should look like this:
ID, ltable_id, rtable_id
where 'id' should be replaced with the proper attribute name (as discussed above). Next, generate the tuples for Table C, one tuple per row, in a way that conforms to typical CSV format (e.g., the values in the tuple are separated by commas, etc.). The same guidelines (to specify data in CSV format) mentioned for table A and B applies to table C.
Example: Table C may look like this
ID,ltable_ID,rtable_ID
0,a1,b1
1,a1,b2
2,a1,b6
3,a3,b1
Here, "ID" is the name of the key attribute for Table A, and is also the name of the key attribute for Table B.
You can add more attributes to the tuples of Table C. If you add an attribute from Table A, you must prefix its name with 'ltable_', and if you add an attribute from Table B, prefix its name with 'rtable_'.
Example: If we decide to include all attributes from Tables A and B in Table C, then Table C may look like this:
ID,ltable_ID,rtable_ID,ltable_name,ltable_hourly_wage,ltable_zipcode,rtable_name,rtable_hourly_wage,rtable_zipcode
0,a1,b1,Kevin Smith,30.0,94107,Mark Levene,29.5,94107
1,a1,b2,Kevin Smith,30.0,94107,Bill Bridge,32.0,94107
2,a1,b6,Kevin Smith,30.0,94107,Michael Brodie,32.5,94107
3,a3,b1,William Bridge,32.0,94107,Mark Levene,29.5,94107
==============================================================================================