Data Science @ UW

Note: this is an incomplete list. More will be added.

Courses in the CS Department

- For general data science introduction, data wrangling, big picture, do CS 638 DS (this class) or CS 838 DS (this used to be CS 784, it will eventually become CS 774). CS 838 DS will be offered again in Spring 2017. You should take only one of these courses, not both, as they are similar, but are covered at different depths.
- For RDBMSs/SQL, take CS 564, then CS 764. Paris Koutris also offers a CS 838 (eventually will be offered as a regular 700-level class) on the theoretical aspects of data management, focusing on relational data. If you can only take one class, take CS 564.
- For machine learning, take CS 540, then CS 760, then CS 761. If you can only take one class, take CS 760. (If you are undergrad and can't get in, take CS 540; it teaches introductory ML.)
- For data visualization, there is a data visualization class being offered by Michael Gleicher. Go to his homepage and look for this course.
- For Big Data systems (that scale up data processing, e.g., Hadoop, Spark, etc), there is a CS 838 class being offered by Aditya Akella. This will eventually be offered as a regular class (CS 744, I believe). It is being offered in Fall 2016.
- For optimization, take CS 524.
- For human-computer interaction, Bilge Mutlu has a set of courses.

What Should You Learn in Terms of Statistics

From a PhD student in Statistics, MS in CS:

From talking to people and reading a few blog posts and data science interview questions, I have kind of made up a list of things that data scientists need to know from the statistics discipline. If he/she is an undergraduate student, I would say keep it simple - know the basic concepts and tools extremely well and know when to use what. Here's the list:

Basic probability concepts: random variables, probability distributions (binomial, geometric, poisson, exponentials, normal, students, chi-square), Bayesian theorem.
Basic statistical inference frameworks: Law of Large Number, Central Limit Theorem, Student theorem.
Basic statistical tests: t-test, paired t-test, non-parametric tests, chi-square test, fisher-exact test... It is crucial to know when to use what. Classes: 309, 310.
Knowing experimental designs would be good - ANOVA, Kruskal-Wallis test, multiple-variate testing. Classes: 324, 424
Linear Regression/Ordinary Least Square: models, formular, inference, assumptions, violations, diagnostics, and remedies Classes: 333
Generalized Linear Model: 333 covers some of that, I think, but it is crucial to learn well about logistic regression. Once you have good foundation in linear regression, GLM would not be too bad to study by yourself.
Some sampling and survey framework. Classes: 421.
Knowing about time-series models would be great - I still wish that I had taken classes in these.

Again, this is only my subjective view. Aside from statistics, for the data analysis phrase in data science pipeline, knowledge in machine-learning methods and frameworks is crucial.

As for books to read:

1, 2, 3: Probability by Pitman, Statistics by Freedman (the most comprehensive statistics book ever). Skip to important stuff. In general, any good prob. and stat. textbook will suffice.
4: Design and Analysis of Experiments by Montgomery - first few chapters.
5: Linear models with R by Faraway.
6: Generalized Linear Model by Nelder (a true classic with lots of stuff), Extending the Linear Model with R by Faraway (easier to read in my opinion).
7: Statistics by Freedman covers some sampling issues, I think. I wish I knew a cheatsheet for this.
8: Anyone knows a good book about time series since I've been searching for one?

From a Ph.D. student in CS:

I don't have much experience in statistics, but my friends highly recommend STAT 609/610 to solidify the stat background.

Some Related Courses on Campus (incomplete list)

LIS 677 - Concepts and Tools for Data Analysis and Visualization

Ours has become a data society. Like at no other time, our world--the natural world, from storm systems to diseases; governments and companies; and our conversations with friends and relatives, even our movements--is recorded in digital format. A few years ago, Google CEO Eric Schmidt famously stated that "There was 5 exabytes of information created between the dawn of civilization through 2003, but that much information is now created every 2 days, and the pace is increasing." (An exabyte is 1 trillion megabytes.) The result is that professional communicators in journalism and mass communication, as well as researchers and private firms, now literally have more data than they know what to do with. There is tremendous need across our society for people who are able to use data to investigate important questions, draw useful insights from those data, and communicate those insights to others--and also to be realistic and honest about what data can and cannot do. That is what this class is for: it is an introduction to the world of data, how data can be used to answer questions and those answers can be effectively and ethically communicated. More specifically, we will offer a combination of conceptual training, instruction in specific tools for data analysis and visualization, and the opportunity to put new skills to use in a final project. This course is intended for Juniors, Seniors and Graduate students and by instructor permission. Research methods experience preferred. Prereq> Junior standing

Page updated

Google Sites

Report abuse