1.5 weeks, due 11:59 pm Wed Feb 8
In this project stage your team will formulate a set of questions that you may want answer to, select a few data sources, then acquire data from the sources (this is discussed in the first few lectures of the class).
Requirements
You must select at least two data sources that contain structured data or from which structured data can be extracted. By structured data I mean data in a format such as relational, XML, JSON, CSV, etc. These two data sources must contain information about a set of overlapping entities, such as books, movies, cars, etc. This is because later we will have to perform entity matching as a class project stage, and we need the two sources to have overlapping entities, so that we can match between the two sources, to find data that refer to the same real-world entities.
Each of the above two sources should contain a reasonable amount of data, and the two sources should have a reasonable amount of overlapping entities. For example, suppose we extract a relational table A from the first source where each tuple describes a person, and suppose we extract a similar table B from the second source. Then each table should have at least 3000 tuples, and they should share at least 100 persons (you can only eyeball the data for this latter requirement, and that is sufficient).
You must select at least 300 text documents that contain some information that you want to extract (to answer the set of formulated questions). We will discuss this more in the class.
What to Submit?
On your team's project page, by the deadline:
Submit a link to a pdf file that describes the following:
the set of questions that you want to answer (these may be somewhat vague right now),
the set of data sources that you have selected. Recall that you are supposed to select at least two data sources that will give you structured data, and at least 300 text documents.
a description of how you have extracted structured data from the two data sources.
what is it that you want to extract from the text documents
the names of open-source tools you have used in this project stage and a brief description of what they do.
Submit links to where we can find structured data that you have extracted from the two sources, and where we can find the text documents. Do not zip this data, as that would require us to download and unzip the data. Instead, submit data in a browse-able format.
Where to Find Data?
There are various places for this. We will discuss more in the class.
Some random examples:
TBD.
Misc Information
Check out MadDSI, especially the links on data acquisition and data conversion and transformation for tools that can help you. Please email us if you find some tools that can help and are not listed here.