Three weeks, due 11:59 pm Tue Oct 11
In this project stage your team will do the following:
Select two Web sites that list data that you can convert later into two relational tables. Examples of such data include products, employees, researchers, papers, movies, music albums, etc. Examples of such Web sites include amazon.com, walmart.com, DBLP, Google Scholar, IMDB, etc. For examples, you can check out Web sites that students in a former data science class have retrieved data from (see the section "The 784 Data Sets").
Note 1: It would be great if you find some new Web sites. But if you are stuck, you can retrieve data from the above Web sites too. Just make sure that you do not simply download data from previous classes and use it. Such attempts will be detected and your team will receive 0 for the entire project.
Note 2: You want to find Web sites where the data that you will download can be matched. For example, if you download movies from two movie Web sites, you want the movies to have enough details so that at least you (as a human user) can look at two movies and decide whether they refer to the same real-world movie. If you cannot even do this, then there is no hope that matching algorithms will be able to do it. We will discuss this more in the class.
Note 3: Related to Note 2, you want the data downloaded from the two Web sites to share a reasonable number of matches. For example, if you download products from Amazon and Walmart, you want the two sets of products to share at least a few hundreds products. We discussed why and how to do this in the class.
Write two crawlers to crawl the Web sites to obtain HTML data (this data will be made public later, so do not select Web sites with sensitive data). From each Web site you will obtain a set of HTML pages. Crawlers are also often called spiders in the literature.
Decide which attributes you will extract from HTML pages, then write scripts to extract those attributes. At the end of this step, you should have converted HTML data from each Web site into a relational table. We will refer to these two tables as A and B. Each table must have at least 3,000 tuples. The tables must be in the format listed at the end of this page. Further, the first attribute of each tuple must be an ID attribute. That is, each tuple in a table must come with a unique ID value. If no ID attribute can be found in the HTML pages, then make up ID values.
Note 1: When deciding which attributes to extract, make sure to extract enough attributes so that later you can reliably match the tuples across tables A and B.
Note 2: Try to extract the same set of attributes for both tables. So both tables should have the same schema. (It is okay if for this step you do not extract the same set of attributes for both tables. In this case, in the matching step much later, you will have to transform the two tables to have the same schema.)
Note 3. If any attribute does not have a value in a Web page, then you will assign a missing value to it.
Misc issues
Some Web sites will allow you to retrieve data in big dump, say in XML format. You are supposed to crawl, retrieve HTML data, and then write scripts to extract structured data from it. This is to help you exercise certain data science skills. So pls ignore XML data or any data in a big "dump" file, if any is available.
Recommended tools: consider Python packages requests, scrapy, beautiful soup
There is plenty of information on the Web about how to use these packages. But if you need more information about scrapy, beautiful soup, pulling information out of Web pages, check out this book.
Try to use tools in the PyData ecosystem (such as scrapy). But if you are far more comfortable using tools in other languages, feel free to do so. Just keep in mind that when you go to work in industry, knowing how to use certain tools in the PyData ecosystem gives you certain advantages. Another issue is if you are using non-Python tools, we may not be able to assist with any problems you may run into.
What happens if an attribute has multiple values, such as "authors". How can you represent this in csv format? We will discuss in the class.
Deliverables for this stage
Your team will set up a Web page that provides links to HTML data and the two tables.
HTML data of each Web site must be listed within a directory for that web site. So if you set up your team's page on Google Sites, you may still need to find a place somewhere else to store HTML data.
The two tables are stored in two files named tableA.csv and tableB.csv
By the deadline we expect that your team's homepage will have links to both HTML data and the two csv tables.
More examples
Here is an examle of deliverables from a previous class project. (Note: this link may be down anytime.) Look at the information below "Deliverable-1". No need for anything fancy. Links to required HTML and csv data are fine.
Required format for tableA.csv and tableB.csv
Table A and B must be in the .csv format with a mandatory header line representing the attribute names. The header line is of the form:
attribute_name1, attribute_name2, ….
Below the header line follow the tuples. Each tuple is a comma-separated list of the attribute values.
Some guidelines to specify data in csv format are given below
Each line must contain the same number of comma-separated fields.
Fields containing comma must be double quoted.
Missing value is represented by an empty field value.
Each tuple must have an ID attribute (this should be the first attribute of the table).
Below is a sample table in csv format:
ID,name,birth_year,hourly_wage,address,zipcode
a1,Kevin Smith,1989,30,"607 From St, San Francisco",94107
a2,Michael Franklin,1988,27.5,"1652 Stockton St, San Francisco",94122
a3,William Bridge,1986,32,"3131 Webster St, San Francisco",94107
a4,Binto George,1987,32.5,"423 Powell St, San Francisco",94122
a5,Alphonse Kemper,1984,,"1702 Post Street, San Francisco",94122
Note that for ID a5, the hourly_wage value is missing.
More examples of tables in csv format can be found from the previous class's project.