Change the World Project

One of the most consistent things that people cited in their The Value of Data assignment was a desire to use data science as a means of achieving positive change in the world.  In this project you will be trying to accomplish exactly this.

Learning Outcomes and Rationale for Selection

The first two projects in this course have been scaffolded in several ways: many of your classmates are working on the same datasets, the goal (to achieve the highest performance possible) is already defined by virtue of working on a Kaggle competition, and there are quite a few examples to draw from.

In this project, you will be coming up with a project of your own design.  You will not only be selecting a dataset (or combination of datasets), but you will be crafting your own questions, cleaning the data yourself, and working towards a goal without a well-worn path to success.

These changes will help you to continue to build your skills in exploratory data analysis, data visualization, and iteration while bringing in meaningful real world context.

Potential Projects

You have a lot of latitude in this project.  Here are a few basic ground rules, and then I'll elaborate on some specific suggestions.

Data Visualization

Data visualization can be used to a wide variety of ends.  In the context of the work we've done so far it has mostly been useful to explore with an eye on building a successful predictive model.  In the context of journalism, it can be used to create info graphics that enrich the story of an article.  Another way it could be used is to make a larger political or social statement.  For instance, here is a visualization of the number of years of life lost due to gun violence in the US.

This makes an immediate visual and emotional impact while allowing for a more personalized story when hovering over one of the strands.  For a more nuanced take on this visualization and others like it, check out this article with Alberto Cairo.

Pick an issue or cause that you care about.  What do you want others to understand about this issue?  How might you accomplish this with a carefully crafted data visualization?

Develop a Data-Powered Tool

You might also consider creating a data-powered tool for a particular user group.  A lot of the Data Science for Social Good projects have this flavor.  Here are some examples of what has been done with the open data sets on data.gov.  If you want to do something more local, I have a dataset of Olin College course registrations (and some code for cleaning the data) that I can give you.

It's possible that you might want to work towards the development of a tool, but developing the tool in its entirety would be too much to do in the context of this relatively short project.  That is fine!  You should, however, be prepared to justify how your project helps take a step in the direction of creating an important tool for a particular group.

The form that this data-powered tool can take is very flexible.  It could be a predictive model (similar to Kaggle) or it could be a tool for exploratory data visualization to help a particular user understand data and extract insight.  Be creative!

Here are some other sources for project ideas (some of these may be overscoped for a 2.5 week project):

Data Sources

This is the most comprehensive list I've found so far.  You're also free to scrape your own dataset, combine multiple existing datasets, or collect your own dataset (this would have to be carefully considered so that it doesn't take too much time).

Teaming

You should work with one other student for this project.  We have an odd number of students, so we will have one group of three.

Getting Started

Create a fork of this repo.  You will be issuing a pull request to turn in your work, so please make a fork of my repo rather than creating your own from scratch.

Deliverables

There are three deliverables for the project.

Project Proposal

You should check in a document to your Github repository that addresses the following topics.

Summary of the Project

Workflow and Schedule

Assessment

This is due Friday, February 26th BEFORE 9am.  I will not have time to discuss these with you in class.  I will be reading them the morning before class, therefore you must have your proposal pushed to Github before 9am on the 26th.

Mid-Project Checkin

In your project proposal you set a schedule for the first week and a half.  Create a document called mid_project_checkin.md in your Github repository that addresses the following points.

This is due Friday, March, 4th.

Final Output and Reflection

The form of your final output will depend largely on the nature of your project.  Whatever form your project takes, make sure that it is very clear how to find it when I look at your Github repository.  If your project is too sensitive to be on Github, we will make other arrangements.

In addition to the final project output, you will also (as a team) be doing a reflection on the project.  Your final reflection should cover the following topics:

Assessment evidence and interpretation: When you turned in your project proposal you detailed an assessment plan for your project.  In your final reflection, please provide any evidence that you think will be helpful in evaluating your project.   If the only relevant piece of evidence is your final output, that is totally fine, however, if there are other pieces of evidence (visualization mockups, ipython notebooks from models you didn't wind up using, data explorations, etc.) make sure to point those out.  In addition to listing these sources of evidence, please provide a brief interpretation of how they help inform the assessment of your project.

Changing the world: Do you think your project has the potential to change the world?  If not, why?  If so, what are the next steps to make this happen?

Learning goals: Did you learn the things that you wanted in this project?  If not, why?  If so, why do you think you were successful?

This is due Friday, March 11th.

Turning in Your Assignment

In order to turn in your assignment all you need to do is issue a pull request from your DataScience16CTW repo to my upstream repo.  For more information on how to issue a pull request, see this page.  The pull request will be used as a way for me to communicate feedback on your assignment.  Remember, even if your final output is not hosted on Github, you should make it easy to find the final output through a prominent link in the README.md file in your Github repo.

Assessment

Mid-Project Checkin (20%): The assessment of this component will be fairly coarse-grained.  I am looking for evidence of an honest amount of work during the first half of the project along with a thoughtful reflection on progress of the project up until now.

Final Output and Reflection (80%):  I will be assessing your final output based on the assessment plan you crafted in your project proposal.  If this assessment plan needs to be modified, make sure that you update your project proposal accordingly.  I will also be assessing your reflection on this project, however, the details of this have not yet been finalized.