Warmup Project
For your first project you will be working (individually) on the Titanic Kaggle competition. When I say individually, I don't mean that you should do this assignment in isolation! What I mean is that you will be turning in your own code and written deliverables, however, you are free to share ideas and offer / receive assistance from your peers. In this project you will start out by exploring the data using some data science power tools (pandas, scikitlearn, Jupyter notebook, etc.). Next, you will go through a guided tutorial of one specific approach to the competition. You will then use this approach as a jumping off point to try your own ideas. Finally, you will use the vast pool of data science knowledge on the web to create a second version of your model. The whole thing wraps up with an in-class, small group discussion of what you found and the lessons you learned.
A video of the titanic wreckage.
Learning Outcomes and Rationale for Selection
I have carefully selected this as our first project for the following reasons:
The dataset is relatively small and easy to work with computationally (it comfortably fits into memory, and pretty much any basic model can be fit to the data in a few seconds or less).
The predictive task is well-defined and relatively uncomplicated, yet admits a multitude of rich approaches that allow you to explore many facets of data science.
You can get immediate feedback on the accuracy of your model by submitting your predictions to Kaggle.
There are a multitude of great tutorials, blog posts, and forum posts on this competition that you can leverage to learn both about this data and about data science more generally.
The learning outcomes that I am aiming for with this project are:
Students will have basic fluency with the toolset we will be using for the remainder of the course (e.g. Jupyter notebook, matplotlib, scikit-learn, pandas, etc.).
Students will understand the machine learning workflow including issues such as parameter tuning, cross validation, model selection, data visualization, and feature engineering.
Students will start the journey of learning how to learn data science. This includes things like being able to successfully read and understand blog posts, tutorials, and forum posts.
Students will develop the ability to communicate ideas as well as understand suggestions from their peers.
What This Assignment is Not About
I feel compelled to preemptively bring up a potential pitfall of using Kaggle competitions to introduce data science. I want to be crystal clear that the point of this assignment is to learn as much as possible. The point of this assignment is not to achieve the highest score on the Kaggle leaderboard. For instance, if you copied some code you didn't understand and used it to get a very high score on the leaderboard, I would consider this a very poor assignment. However, if you throughtfully try various strategies, incorporate lots of interesting learning resources, and do a good job communicating with your peers regarding what worked and what didn't, I would consider that highly successful even if your score on the leaderboard was not that high.
Deliverables / Assignments
Getting started
First, make a fork of the DataScience16WarmupProject base repository. Once you have your repository setup, you should make an account at kaggle.com. Next, navigate to the Titanic dataset and download both the training and testing data. You will want to commit these to your repository.
Data Exploration
Create a Jupyter notebook that explores some basic properties of the data. Here are some suggestions:
Look at the survival rate based on different features (e.g. males versus females, young versus old, etc.)
Look at the survival rate based on conjunctions of features (e.g. young men versus older men, rich women versus poor women, etc.)
Visualize the relationship between the survival and a continuous attribute using a scatter plot. If the plot is noisy consider doing some binning.
Remember that Jupyter notebook is literate programming. Make sure to include lots of motivation and interpretation in the form of markdown both before and after the code. Some more advanced ideas for exploration along with how to implement some of the basics can be found in this Kaggle script. However, I wouldn't necessarily jump right to this as you may miss out on some learning if you copy too much from the example.
Push your notebook to your Github repo as DataScience16WarmupProject/data_exploration.ipynb.
Model Iteration 1
Complete the DataQuest mission Getting Started with Kaggle (please note that the website forces you to use Python 3. Here is a complete cheat sheet, but mainly be aware that print should be called just like any other function).
Troubleshooting: I have found that DataQuest's mechanism for telling whether or not you got the correct answer is sometimes brittle. If you know too much, you may get yourself into trouble :). If you are confident that your solution is correct even though you are not being allowed to continue, it may be because the data type of your series is wrong. If you cannot understand why your answer is continuously marked wrong, consider changing the type of the series to "object" using .astype(object) .
Next, you should create a Jupyter notebook called model_iteration_1.ipynb that repeats the steps from the DataQuest mission and generates the appropriate submission file. Make sure to include markdown that explains the basic steps that your code is performing. Upload the submission file to Kaggle. Add the accuracy to your Jupyter notebook.
Next, try to improve on the basic model from the DataQuest module. At the bottom of your model_iteration_1 notebook, create a new markdown cell. In the markdown cell explain a revision you want to try to the model. This revision could be based on your intuition or the result of your explorations of the data from the previous part. Once you have described the revision and the motivation for the revision, create a new code cell that implements your model, use the predictions of the new model to create a submission file, and upload it to Kaggle. Annotate you notebook with the new score and briefly describe what you learned about your model based on the score. You should try at least two revisions to the DataQuest model.
When you are done, push your notebook to Github as DataScience16WarmupProject/model_iteration_1.ipynb.
Model Iteration 2
Now that you've gotten your feet wet, you are free to explore! The goal of this next part will be learning to learn how to do data science. In this part it is not enough to simply try some new variation of the model, but instead I want you to revise your model based on the ideas of another data scientist. Any online resource is fair game, but here are some great places to start:
You can finish the DataQuest Kaggle module called improving your submission and use that for inspiration
Here is a data scientist who achieved a very high score using a model written in Python
Here is another tutorial that gets a very high score. Unfortunately it is written in R (another programming language commonly used for data science). You can probably extract the basic ideas and then translate them to Python.
The Kaggle forum for the Titanic competition is a great place to search for ideas.
Kaggle scripts for the Titanic competition are great resources.
Here is another page with more suggestions.
Note: you are expected to contribute to this list. Please e-mail me suggested resources and I'll add them to the list.
Create a notebook called model_iteration_2.ipynb that describes:
The source you are drawing inspiration from
Any additional ideas of your own that you are incorporating
Any relevant graphs or visualizations that support the revisions
The implementation of the model
The resultant score the model achieved when submitted to Kaggle
Your interpretation of the score (was the result what you expected? does it make sense?)
Ideas for future explorations
Push your code to Github as DataScience16WarmupProject/model_iteration_2.ipynb.
In-Class Sharing and Presentation
You should come to class ready to engage in a class-wide discussion. Prompts for discussion include:
The various models you tried
The various resources you found that helped you along the way
Tools that seem to be especially useful
Processes that you tried that seemed to work well for writing good data science code. Good code in this context could mean code that is concise, readable, maintainable, extensible, etc. You are not expected to have all the answers! Even if you tried something that seems suboptimal, put it out there, and see if someone has suggestions for improvement. Topics could include version control strategies, workflows, code refactoring, object-oriented design, etc.
Reflections on strategies for learning that you tried that didn't pan out
Any mysterious results that you can't yet explain.
Your burning questions (have at least two open-ended discussion questions you can use to get the others in the group talking).
We will spend about twenty-five minutes discussing this as a class.
Turning in Your Assignment
In order to turn in your assignment all you need to do is issue a pull request from your DataScience16WarmupProject repo to my upstream repo. For more information on how to issue a pull request, see this page. The pull request will be used as a way for the course staff to communicate feedback on your assignment.
Assessment
Your grade for this assignment will be based on your three notebooks. The notebooks (as a whole) will be graded on four components: functionality (40%), code documentation (15%), code style (15%), and writing quality (30%) according to this rubric.