Lecture 01

For today you should:

1) Read Chapter 1 of Think Stats 2e on NB.

2) Install Anaconda.

3) Fork and clone the class repo (see below).

Today:

1) Class overview

2) The Second Data Revolution

3) Project planning

4) Chapter 1 exercises

For next time you should:

1) Read Chapter 2 of Think Stats 2e on NB.

2) Watch Jake Porway at TEDx

Optional: watch this discussion on Fox

Consider applying for The Eric & Wendy Schmidt Data Science for Social Good Summer Fellowship at the University of Chicago.

Data Science

An interdisciplinary toolkit designed for the use of data to answer questions and guide decisions.

1) Statistics / machine learning

2) Software engineering

3) Databases / data processing architectures

4) Visualization

5) Domain knowledge

Many projects require knowledge/skill in all of these areas.

Data scientists should have some expertise in all, but even so, most projects require teams.

This toolkit is

1) Versatile: applicable to science, engineering, business, politics, and on and on.

2) In demand: increasing demand for data-driven everything.

3) Transformative: as I'll explain in a minute, we are in the middle of a revolution that will fundamentally change people's expectations about the nature and impact of data.

A few more links about data science:

1) Here's what Wikipedia thinks

2) Drew Conway's data science Venn diagram

4) Here's a relevant discussion on reddit/r/statistics, and my contribution:

AllenDowney 26 points 1 year ago 

Having read "The Theory That Would Not Die" and "The Lady Tasting Tea" recently, I suggest the following conjecture:

The term "data scientist" has been created to describe what people want from a statistician, but which many statisticians fail to provide because statistics, as a field, spent too much time in the 20th century on problems in philosophy of science, and theoretical mathematical problems, and not enough time on practical applications and the use of computation to analyze data. As a result, many graduates from statistics programs have a set of skills that is not a good match for what the market wants them to do. This market vacuum is being filled by "data science."

That's my theory, which is mine.

The second data revolution

The second data science revolution

Software

1) I recommend that you work in Linux.  If you do, it is more likely I can help you if you run into trouble.  If you choose not to, you are swimming at your own risk (although there are likely to be other people who can help you).

2) I recommend using Python 2, although all code for Think Stats 2 should work with Python 3.

3) You will need NumPy, SciPy and some other packages.  If you are not already using Anaconda, which is a Python distribution that provides these packages and many more, I strongly recommend it.

4) The GitHub repository for this class is  https://github.com/AllenDowney/ThinkStats2.  You should fork it on GitHub and then clone it to your hard drive.  If you are not familiar with Git and GitHub, you should work through my Git tutorial.

Project selection

Over the last two months I have recruited 15 external collaborators with projects for you to work on.  Many of them are related to the target subject of the class this semester: Health and Medicine, but others are on unrelated topics.  I hope you all find something you are excited about.

1) Please read this page, which is how I explained the projects to collaborators,

2) Read the project survey (coming soon) so you know what I am going to ask, but do not fill it out yet.

3) Read the project descriptions.  They are long, and there are a lot of them, but please take the time to read them carefully so you can decide which ones you want to work on.

You will have time in class today to discuss projects with your classmates, and think about possible teams.  If you have questions, we will have a chance to answer them on Friday.

If you have a teammate you want to work with, the two of you should fill out one copy of the survey, indicating your joint interests.

Chapter 1

"The thesis of this book is that data combined with practical

methods can answer questions and guide decisions under uncertainty."

Reading questions:

Study design: what's the difference between:

1) Observational vs. experimental

2) Cross-sectional vs. longitudinal

3) Representative vs. oversampled

4) Raw data vs. recodes

5) What's an example of something you might do during data cleaning?

6) What's an example of something you might do during data validation?

Ethical considerations: 

1) Depending on where the data are from, there might be legal and/or ethical requirements for what you do with it, and what you can do with your results.

2) When you are working with data, it is easy to lose sight of the context, so easy that being "treated like a number" is a metaphor for a certain kind of unethical behavior.

3) On the other hand, when you are working with data that has emotional weight, it is often necessary to maintain some professional detachment.

Finding the right balance between (2) and (3) is an important element of ethical data science.

DataFrame

1) Kinda like an array.

2) Kinda like a map from column name to Series.

Series

1) Kinda like a list.

2) Kinda like an ordered map from label to value.

Exercise: Do exercise 1.1 in Think Stats (filling in the blanks in chap01ex.ipynb)