Introduction
This class provides an introduction at the undergraduate level into data science. Data science is a new popular field. It develops principles, algorithms, tools, and best practices to manage data, focusing on three topics: (a) analyzing raw data to infer insights, (b) building data-intensive artifacts (such as recommender systems, knowledge bases), and (c) designing data-intensive experiments to answer questions (e.g., A/B testing). It combines techniques from the fields of databases, machine learning, human-computer interaction, visualization, statistics, optimization, and more.
The class is a more "introductory" version of the gradulate-level class CS 784: Topics in Data Science (see here for the homepage of CS 784 in Spring 2016). Topics that I plan to cover (this is a tentative list):
The big picture: data science, Big Data, relational database systems, NoSQL systems; how they all fit together
Introduction to data science: the topics, the steps, stages, etc.
Data acquisition (from multiple sources, from internal and external)
Data extraction (aka extracting structured values out of structured data such as HTML pages and text documents)
wrapper construction and information extraction from text
Data understanding, cleaning, and transforming
Data integration: string matching, data matching, schema matching and mapping, main data integration approaches
Data exploration and analysis
OLAP, classification, clustering, association rule mining, anomaly detection, data mining
Cross-cutting issues
Visualization, how the process is done, how the teams are structured
Scaling, monitoring, crowdsourcing
Building data-intensive artifacts: knowledge bases
Designing data-intensive experiments: A/B testing
Important data types: as data scientists, you need to know important data types and at least something about how to handle them
Relational data
Text, Web data
Data types that we will touch upon but won't cover: social media data, time series data, graph data
Target Audience
This class targets undergraduate students, both in CS and outside CS. Graduate students may find this class useful as well (especially graduate students in other departments, as they are not likely to be able to get into CS 784, due to enrollment limit). In general, if you work with data at UW and want to know more about data science, or if you are just exploring working with data, this class can be appropriate.
Prerequisites
You must have CS 367 (Introduction to Data Structures) or an equivalent background. You should know Python, machine learning (in particular supervised learning), and relational database systems. If you have taken CS 540 and CS 564, you are fine.
If not, you must be willing to learn these three topics in the first 1-2 months of the class. In particular, I expect you to learn Python on your own (or consult a book or an online class). I can do a bootcamp for 1-2 weekends on machine learning and relational databases, covering just enough so that you can start in the class. You should be learning more about these topics as the class progresses.
Thus, the class is not designed to teach Python, machine learning, and relational databases (other than the bootcamps to give you a start).
What You Can Expect to Get Out of the Class
It is not possible to teach data science thoroughly in a single class. The goal of this class is to give you a high-level overview of the data science field. We will not go deep into machine learning, statistics, optimization, for example. We will focus more on the problems, challenges, and current solutions. There will be a project that allows you to become familiar with some popular data science tools, such as those in the Python data eco-system, but the course is not designed to teach you about these tools.
Once taking the course, you will know enough to decide which direction to go deeper, and where to look for materials for that.
Course Format and Workload
Class will meet twice a week for lectures, 75 min each. You are expected to read textbook chapters and papers as the lectures unfold, and you will be reading them (and the lecture slides) for the midterm and final.
There will be a midterm and a final, and a semester-long project. The project will ask you to do various data science tasks using tools in Python, and you will do it in a team. There may or may not be homeworks (I haven't decided on this).
Workload-wise it will probably be similar to that of other 500-level classes, such as CS 564. Of course, if you haven't had a machine learning or relational database background, or if you haven't known Python as yet, then your workload will be somewhat higher.
Anticipated Frequently Asked Questions
Do I really need to have CS 564 and CS 540 before taking this class? As discussed above in "Prerequisite", no. If you haven't taken these classes, we will provide a bootcamp for 1-2 weekends to help you catch up. You would still need to do some learning/reading on your own though.
Can I use this toward the CS major? Yes, it is a 3 credit class and will count towards the undergraduate CS major (as an elective in the major).
Should I take this or CS 784? If you are undergraduate (either in or out of CS), you should take this class. If you are a graduate student in the CS department, take CS 784, unless you want to learn this now and CS 784 is not offered, then you can consider taking this. If you are a graduate student in other departments, you can try to take CS 784 but you may not be able to get in, due to enrollment limit. In that case, this class will also be quite appropriate for you.
In short, CS 784 is designed more for CS graduate students. It will move at a faster pace, read more research papers, do deeper and more complicated projects, and has a looser structure. It may also cover certain cutting-edge ill-defined topics. On the other hand, CS 638 is a "cleaner" version where things are (hopefully) more well-defined.