Final projects are to be done in teams of 2. A 1-2 page PDF of your project proposal is due by April 23. We will send you feedback ~1 week later. Please note that your project proposal will factor into your overall project grade, so make sure that it is written well and follows the guidelines below
Your project proposal must detail the data that you plan to use, how you will pre-process it, the questions you will ask, the machine learning algorithms you expect to use (permitted to change), and what you expect to learn from your project. Most importantly, you should demonstrate in the proposal that you have gained familiarity with the data set.
This project is meant to be open ended. The goal here is for you to spend time thinking deeply about machine learning. To give you an idea of the scope, we are expecting you to spend ~40 hours (per person) between now and the end of the semester on the project.
Below are some links to some publicly available data sets. We will be adding more links in the course of the semester. Your project may make use of these or other data sets. Projects based on data sets emanating from PRC sources are highly desirable.
Face recognition, collaborative filtering, web ranking (see bottom, under "Projects")
See here for more collaborative filtering data
Blogs (with spam labels)
Enron e-mail data set (see also here)
ICPSR at the University of Michigan. ICPSR stands for Inter-university Consortium for Political and Social Research.
Data from the U.S. Census Bureau
Data from papers in the Journal of Applied Econometrics
NBER Macro History Database. See also their list of Business Cycle Dates.
PSID Panel Study of Income Dynamics
Knoema: Economic data about many countries.
Data of National Institutes of Health of United States: Part of data supplied by NIH
Chinese Data Sets:
Projects based on data sets emanating from PRC sources are highly desirable. The NYI Shanghai Center for Data Science has created a portal for Chinese data sets: