Warmup Project

For your first project you will be working (individually) on the Titanic Kaggle competition.  When I say individually, I don't mean that you should do this assignment in isolation!  What I mean is that you will be turning in your own code and written deliverables, however, you are free to share ideas and offer / receive assistance from your peers.  In this project you will start out by exploring the data using some data science power tools (pandas, scikitlearn, Jupyter notebook, etc.).  Next, you will go through a guided tutorial of one specific approach to the competition.  You will then use this approach as a jumping off point to try your own ideas.  Finally, you will use the vast pool of data science knowledge on the web to create a second version of your model.  The whole thing wraps up with an in-class, small group discussion of what you found and the lessons you learned.

A video of the titanic wreckage.

Learning Outcomes and Rationale for Selection

I have carefully selected this as our first project for the following reasons:

The learning outcomes that I am aiming for with this project are:

What This Assignment is Not About

I feel compelled to preemptively bring up a potential pitfall of using Kaggle competitions to introduce data science.  I want to be crystal clear that the point of this assignment is to learn as much as possible.  The point of this assignment is not to achieve the highest score on the Kaggle leaderboard.  For instance, if you copied some code you didn't understand and used it to get a very high score on the leaderboard, I would consider this a very poor assignment.  However, if you throughtfully try various strategies, incorporate lots of interesting learning resources, and do a good job communicating with your peers regarding what worked and what didn't, I would consider that highly successful even if your score on the leaderboard was not that high.

Deliverables / Assignments

Getting started

First, make a fork of the DataScience16WarmupProject base repository.  Once you have your repository setup, you should make an account at kaggle.com.  Next, navigate to the Titanic dataset and download both the training and testing data.  You will want to commit these to your repository.

Data Exploration

Create a Jupyter notebook that explores some basic properties of the data.  Here are some suggestions:

Remember that Jupyter notebook is literate programming.  Make sure to include lots of motivation and interpretation in the form of markdown both before and after the code.  Some more advanced ideas for exploration along with how to implement some of the basics can be found in this Kaggle script.  However, I wouldn't necessarily jump right to this as you may miss out on some learning if you copy too much from the example.

Push your notebook to your Github repo as DataScience16WarmupProject/data_exploration.ipynb.

Model Iteration 1

Complete the DataQuest mission Getting Started with Kaggle (please note that the website forces you to use Python 3.  Here is a complete cheat sheet, but mainly be aware that print should be called just like any other function).

Troubleshooting:  I have found that DataQuest's mechanism for telling whether or not you got the correct answer is sometimes brittle.  If you know too much, you may get yourself into trouble :).  If you are confident that your solution is correct even though you are not being allowed to continue, it may be because the data type of your series is wrong.  If you cannot understand why your answer is continuously marked wrong, consider changing the type of the series to "object" using .astype(object) .

Next, you should create a Jupyter notebook called model_iteration_1.ipynb that repeats the steps from the DataQuest mission and generates the appropriate submission file.  Make sure to include markdown that explains the basic steps that your code is performing.  Upload the submission file to Kaggle.  Add the accuracy to your Jupyter notebook.

Next, try to improve on the basic model from the DataQuest module.  At the bottom of your model_iteration_1 notebook, create a new markdown cell.  In the markdown cell explain a revision you want to try to the model.  This revision could be based on your intuition or the result of your explorations of the data from the previous part.  Once you have described the revision and the motivation for the revision, create a new code cell that implements your model, use the predictions of the new model to create a submission file, and upload it to Kaggle.  Annotate you notebook with the new score and briefly describe what you learned about your model based on the score.  You should try at least two revisions to the DataQuest model.

When you are done, push your notebook to Github as DataScience16WarmupProject/model_iteration_1.ipynb.

Model Iteration 2

Now that you've gotten your feet wet, you are free to explore!  The goal of this next part will be learning to learn how to do data science.  In this part it is not enough to simply try some new variation of the model, but instead I want you to revise your model based on the ideas of another data scientist.  Any online resource is fair game, but here are some great places to start:

Create a notebook called model_iteration_2.ipynb that describes:

Push your code to Github as DataScience16WarmupProject/model_iteration_2.ipynb.

In-Class Sharing and Presentation

You should come to class ready to engage in a class-wide discussion.  Prompts for discussion include:

We will spend about twenty-five minutes discussing this as a class.

Turning in Your Assignment

In order to turn in your assignment all you need to do is issue a pull request from your DataScience16WarmupProject repo to my upstream repo.  For more information on how to issue a pull request, see this page.  The pull request will be used as a way for the course staff to communicate feedback on your assignment.

Assessment

Your grade for this assignment will be based on your three notebooks.  The notebooks (as a whole) will be graded on four components: functionality (40%), code documentation (15%), code style (15%), and writing quality (30%) according to this rubric.