Choose Your Own Adventure

Abstract

Now that you've taken your first step in your journey to becoming a data scientist, we're going to do a second iteration of Kaggle to expand your skills further. The high-level ways in which we will be expanding our focus from the last project are: the dataset will be more complex (containing either non-numerical data or spatially and temporally distributed data), the project will be longer (4 class sessions instead of 3), and you will be working with a partner (instead of by yourself).

Learning Outcomes and Rationale for Selection

A major focus of this course is iteration. By taking a second spin through Kaggle, you will get another chance to explore machine learning within the friendly learning environment of Kaggle. Through this new opportunity, you will begin to develop generalizable lessons about both tools and process for successfully doing machine learning.

Another dimension of this project, is that we are gradually increasing the amount of autonomy available to students. In this project, you will have both more choice (in dataset and in approach) as well as more responsibility for taking charge of your learning experience.

The final dimension that this project explores is teamwork and communication. This project emphasizes both working directly with one of your peers, and engaging in collaborative brainstorming with others working on your same dataset.

Datasets

You must choose one of these datasets for your project. The reason I am not allowing people to branch out beyond these is that I want to make sure that there is a critical mass of teams working on each. This overlap is necessary for the collaborative brainstorming that we will be doing throughout this project.

SF Crime Data

Here is the relevant text from the Kaggle competition page.

Predict the category of crimes that occurred in the city by the bay

From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz.

Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay.

From Sunset to SOMA, and Marina to Excelsior, this competition's dataset provides nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given time and location, you must predict the category of crime that occurred.

Here are some of the interesting features of this dataset:

1. Multi-class
2. Spatial
3. Temporal
4. Relatively active forum and lots of nice scripts
5. Potentially significant from a data visualization / data journalism standpoint

Bike Share

Here is the relevant text from the Kaggle competition page.

Forecast use of a city bikeshare system

Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.

Here are some of the interesting features of this dataset:

1. Spatial
2. Temporal
3. Non-traditional error metric (presents some challenges)
4. Lots of people worked on it so there are lots of forum posts and scripts. The forum is not all that active now though.
5. Interesting data visualization opportunities

Sentiment Analysis on Movie Reviews

Here is the relevant text from the Kaggle competition page.

Classify the sentiment of sentences from the Rotten Tomatoes dataset

"There's a thin line between likably old-fashioned and fuddy-duddy, and The Count of Monte Cristo ... never quite settles on either side."

The Rotten Tomatoes movie review dataset is a corpus of movie reviews used for sentiment analysis, originally collected by Pang and Lee [1]. In their work on sentiment treebanks, Socher et al. [2] used Amazon's Mechanical Turk to create fine-grained labels for all parsed phrases in the corpus. This competition presents a chance to benchmark your sentiment-analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.

Here are some interesting features of the dataset:

1. Multi-class
2. Text data (lots of natural language processing here)
3. An opportunity to branch out into more complex techniques via this website.
4. Temporal
5. This one has less posts on the forum, and there is no scripts page.

Deliverables / Assignments

Getting started

First, make a fork of the DataScience16CYOA base repository. Once you have your repository setup, you choose one of the datasets, download the data, and check it into your repository (if it is not too large).

Project Proposal and Learning Goals

On Friday February 5th, you should come to class with a document (printed out as well as checked into your repository) that describes:

1. What dataset you are planning to work with
2. Who your partner is
3. For each group member: what are you hoping to learn from this project? These goals could be specific data manipulation skills, software engineering skills, or communication skills.
4. How do you foresee this project fulfilling the learning goals identified above. Be specific. If there are adjustments that could be made to bring things into better alignment, let me know (I won't guarantee that I will allow it, but it's worth a discussion).

Data Exploration

Create a Jupyter notebook that explores some your dataset. You should focus on visualization and description rather than training a predictive model. While you will certainly iterate on this process, you must have some version of this done before our in class discussion / brainstorming session on Tuesday February 9th.

Model Iteration

You should have an Jupyter notebook (or a set of notebooks) that walk through the various models you tried for the competition. For each model include your rationale for creating the model, the code, the results, and interpretation of those results. You must have some version of this done before our in-class discussion on Tuesday February 16th.

Overall Story

You should writeup a narrative of your work on this project. You should include both your explorations of the data, models you tried, and interpretation of those results. The story here is distinct from the exact chronology of your work on the project. You are free to take a bit of creative license with what you include and how you present your work to highlight the important takeaways.

This writeup should be written as a notebook, and the document should be written for an external audience. The external audience in this case includes potential employers, folks on Kaggle, other data scientists, etc. This assignment is due on Tuesday February 23rd.

Insight Extraction and Generalization

One the day the project is due, you will be going over your final results with other folks working on the same dataset. Your conversation should focus on extracting generalizable lessons of what worked well and what didn't for this specific dataset. Here the term "worked well" could be defined in terms of accuracy, quality of work flow, quality and ease of collaboration, etc.

You will be presenting the results of this discussion to the class as a whole. The aim will be for each group (where group is defined as all students working on a particular dataset) to spend 15 minutes presenting the work to the rest of the class.

Turning in Your Assignment

In order to turn in your assignment all you need to do is issue a pull request from your DataScience16CYOA repo to my upstream repo. For more information on how to issue a pull request, see this page. The pull request will be used as a way for the course staff to communicate feedback on your assignment.

Assessment

Your grade for this assignment will be based on your notebooks. Primarily I will be assessing your "Overall Story" notebook, but I may look at other notebooks to get a better sense of the totality of your work on the project. Your grade will be based on four components: functionality (40%), code documentation (15%), code style (15%), and writing quality (30%) according to this rubric.

Page updated

Google Sites

Report abuse