Home

Welcome to Data Science at Olin College, Spring 2019.  

Instructor: Allen Downey

Meetings: Monday and Thursday, 1:30 to 3:10pm, AC417.

Books:  Downey, Think Stats 2e, Downey, Think Bayes,
Vanderplas Python Data Science Handbook.  
Both are available in free online versions as well as paper copy.

Topics

Students in this class develop a toolkit for working with real datasets, performing exploratory data analysis, and generating effective visualizations.

We follow the order of topics presented in Think Stats, 2nd Edition:

1) Data management: extract, transform and load (ETL), data cleaning, and validation.

2) Exploration of a single variable: representing distributions, summary statistics, outliers and errors, robust statistics.

3) Exploration of variable relationships: scatter plots, correlation, linear regression, non-linear relationships.

4) Statistical inference, including hypothesis testing and estimation, using a simulation-based approach.

5) More than two dimensions: multivariate regression, logistic regression, visualization challenges.

6) Machine learning: feature extraction and selection, survey of methods, use of scikit-learn.

7) Python libraries for working with data, especially NumPy and Pandas.

Policies

Work in this course includes
  • Reading and other preparation.
  • In-class exercises.
  • Homeworks.
  • Quizzes.
  • Reports.

Homeworks
Homeworks include exercises from Think Stats, submitted in the form of Jupyter notebooks.  These are primarily intended to help you learn, so they are evaluated on a coarse scale.

Quizzes
We will have 6-8 in-class quizzes covering material from the readings and in-class exercises.    These are intended to evaluate whether you are keeping up with the class and understanding the content.  Make-up quizzes are generally not allowed because of practical limitations.  Instead, I will drop the lowest quiz score at the end of the semester, so you can miss one with impunity.

Reports
Over the course of the semester, you will write and publish three reports presenting your analyses.  I will make suggestions for datasets you can use, but you also have the option of finding others.  You can use the same dataset for all four reports, or any number of different datasets.

You can work alone or in teams of two.  You can keep the same team for all four reports or mix it up.

The goal of each report is to use data to tell a story.  I will provide examples of the kind of analysis and presentation you should do.

We will publish your reports in the form of (1) a blog post that presents the primary results and (2) a Jupyter notebook with the details.  Some reports might appear as guest articles in my blog, Probably Overthinking It.

Grading
Your final grade will depend on a weighted sum of your scores for homeworks, quizzes, and reports.  I will deduct some credit for late homeworks and reports, and for failures of professionalism.


Participation and professionalism
As students, you share responsibility for creating and maintaining a classroom atmosphere that is conducive to everyone's learning and enjoyment.  I hope you will think about how your participation contributes to the learning environment.

Some things you can do to help:
0) Come to class.
1) Come to class on time!  I will do my best to use class time effectively.  Late arrivals are disruptive.
2) Come to class prepared.  Make sure you have always at least skimmed the reading.
3) Try not to fall behind.  If we are all working on the same stuff at the same time, everything works better.
4) Take care of your brain.  Eat well, sleep well, get some exercises.  Come to class ready to work.
5) Be professional.  If you have to miss a class, or need to submit work late, communicate with the instructor.
6) Be respectful toward the instructor and your fellow students.
7) Be generous with your ideas and your time.  Help each other.
8) Be reflective.  Think about what's working and what's not, and take responsibility for making the class work for you.
9) Be honest.
10) Have fun!