Course Description

Introduction

This class provides an introduction at the graduate level into data science. Data science is a new popular field. It develops principles, algorithms, tools, and best practices to manage data, focusing on three topics: (a) analyzing raw data to infer insights, (b) building data-intensive artifacts (such as recommender systems, knowledge bases), and (c) designing data-intensive experiments to answer questions (e.g., A/B testing). It combines techniques from the fields of databases, machine learning, human-computer interaction, visualization, statistics, optimization, and more.

The class is the same as the class "CS 784: Topics in Data Science" that I used to teach (see here for the homepage of CS 784 in Spring 2016). Topics that I plan to cover (this is a tentative list):

  • The big picture: data science, Big Data, relational database systems, NoSQL systems; how they all fit together

  • Introduction to data science: the topics, the steps, stages, etc.

  • Data acquisition (from multiple sources, from internal and external)

  • Data extraction (aka extracting structured values out of structured data such as HTML pages and text documents)

    • wrapper construction and information extraction from text

    • Data understanding, cleaning, and transforming

    • Data integration: string matching, data matching, schema matching and mapping, main data integration approaches

    • Data exploration and analysis

      • OLAP, classification, clustering, association rule mining, anomaly detection, data mining

    • Cross-cutting issues

      • Visualization, how the process is done, how the teams are structured

      • Scaling, monitoring, crowdsourcing

    • Building data-intensive artifacts: knowledge bases

    • Designing data-intensive experiments: A/B testing

    • Important data types: as data scientists, you need to know important data types and at least something about how to handle them

      • Relational data

      • Text, Web data

      • Data types that we will touch upon but won't cover: social media data, time series data, graph data

Target Audience

This class targets graduate students, both in CS and outside CS. In general, if you work with data at UW and want to know more about data science, or if you are just exploring working with data, this class can be appropriate.

Prerequisites

You must have CS 367 (Introduction to Data Structures) or an equivalent background. You should know Python, machine learning (in particular supervised learning), and relational database systems. If you have taken CS 540 and CS 564, you are fine.

If not, you must be willing to learn these three topics in the first one month of the class. In particular, I expect you to learn Python on your own (or consult a book or an online class). You should be learning more about these topics as the class progresses.

Thus, the class is not designed to teach Python, machine learning, and relational databases.

What You Can Expect to Get Out of the Class

It is not possible to teach data science thoroughly in a single class. The goal of this class is to give you a high-level overview of the data science field. We will not go deep into machine learning, statistics, optimization, for example. We will focus more on the problems, challenges, and current solutions. There will be a project that allows you to become familiar with some popular data science tools, such as those in the Python data eco-system, but the course is not designed to teach you about these tools.

Once taking the course, you will know enough to decide which direction to go deeper, and where to look for materials for that.

Course Format and Workload

Class will meet twice a week for lectures, 75 min each. You are expected to read textbook chapters and papers as the lectures unfold, and you will be reading them (and the lecture slides) for the midterm and final.

There will be a midterm and a final, and a semester-long project. The project will ask you to do various data science tasks using tools in Python, and you will do it in a team.

Workload-wise it will probably be similar to that of other 700-level classes, such as CS 764. Of course, if you haven't had a machine learning or relational database background, or if you haven't known Python as yet, then your workload will be somewhat higher.

Anticipated Frequently Asked Questions

  • Do I really need to have CS 564 and CS 540 before taking this class? As discussed above in "Prerequisite", no. If you haven't taken these classes, you will need to do some learning/reading on your own though.

  • Can I count this course for MS core and/or PhD breadth? Yes, it is a 3 credit class and will count towards those goals.