Introduction
The goal of this project is to get your hand "dirty" as a data scientist (and to practice certain materials taught in the class). After finishing the project, you will gain a much better appreciation for working with "data in the wild", a better understanding of what it means to work as a data scientist, a deeper understanding of the class materials, a chance to work with popular data science tools in Python, and a glimpse into some research efforts in data science.
Specifically, in this project, you will select two Web sources, crawl to retrieve HTML data, perform data extraction to convert the HTML data into two relational tables (that describe entities such as persons, products, books, movies, papers, etc.). Next, you will explore and clean the tables, then use Magellan, a data matching system developed here at UW-Madison, to match the two tables (i.e., find tuples that refer to the same real-world entity). Next, you will use these matches to integrate the two tables into a single unified database. Finally, you will perform data analysis on this database. (Some of these stages may be omitted if we are out of time.)
Stages
All the dates below may still be changed. All deadlines are 11:59pm on the dates mentioned.
Stage 0: form team, two weeks, due Tue Sept 20. You will enter team information into a page that we will provide.
Stage 1: crawl and extract to obtain two tables, three weeks, due Tue Oct 11
Stage 2: explore, understand, clean, transform the two tables, two weeks, due Tue Oct 25
Stage 3: blocking, two weeks, due Tue Nov 8 (NOW DUE SUN NOV 13)
Stage 4: sample, label, find the best matcher, apply, three weeks, due Tue Nov 29 (NOW DUE THUR DEC 1)
Stage 5: merge the data and perform some analysis, two weeks, due Thur Dec 15