The required book for this class is “Confident Data Skills” by Kirill Eremenko. The book is meant for a wide audience, is very recently published, and is inexpensive. We will only be reading chapters 1-5, after which the book goes into techniques that will be covered in more detail in your other classes.
Much of the readings for this course are taken from popular publications that do a good job of communicating arguments using data.
"Confident Data Skills" Chapter 1: Defining Data
and
French election results: Macron’s victory in charts
"Confident Data Skills" Chapter 2: "How data fulfills our needs" and Chapter 3: "The data science mindset"
and
I also love This collection of "falsehoods programmers believe in". It isn't assigned reading, but I recommend you poke around. A good example is Falsehoods programmers believe about time, which starts with "There are always 24 hours in a day."
"Confident Data Skills" Chapter 4: "Identify the question" and Chapter 5: "Data Preparation"
and
We’re Measuring the Economy All Wrong
The following articles explore a popular method of processing English text (the word2vec algorithm) and how it encodes societal biases of the training data. It is a pretty tough read, so you may need to google some terms and take some time to work through it.
And a fun article:
and
Murder rates don't tell us everything about gun violence
How does Spotify know you so well?
and
YouTube, the Great Radicalizer
1. Scientific Racism's new face:
https://medium.com/@blaisea/physiognomys-new-clothes-f2d4b59fdd6a
2. An exploration into predicting someones gender from their face.
http://gendershades.org/index.html
3. Don't trust your data: how to spot photoshopped images. This is a long read, and I don't expect you to get through all of it. But we are using techniques from it in class, and I think it is super interesting.
http://blackhat.com/presentations/bh-dc-08/Krawetz/Whitepaper/bh-dc-08-krawetz-WP.pdf
Ethics and Data Science by Mike Loukides, Hilary Mason, DJ Patil. Two ways to get this:
This reading is pretty long. It is ok to skip around and pick some interesting parts, but I think it is really worth the time.
Scientists rise up against statistical significance
and
We Experiment on Human Beings!
bonus (optional, hard, deep)
Why most published research findings are false
The media has a probability problem
Look, I know you are more worried about finals than doing the reading. So why don't you save this one for over break? This, to me, gets at the core of data science: