Big Data Analytics

Logistics

Instructor: Prof. Zachary Ives, 305 Levine Hall

Your fantastic TAs: Shubham Annadate, Vatsal Chanana, Pooja Consul, Andrew Cui, Craig Fan, Hoyt Gong, Isha Gupta, Anavi Kaushik, Hemanth Kothapalli, Younghu Park, Aishwarya Singh, Dewang Sultana, Raghav Vedire, Kunyang Zhang, Margaret Zheng, Marko Zotovic

Location: Heilmeier Hall, Towne 100

Due to COVID-19 precautions, we are moving the class totally online. Please stay tuned to Piazza for updates.

Synchronous lectures + Q&A: Mondays and Wednesdays 12:00pm - 1:30pm. Asynchronous lectures will be available.

Optional recitations: Fridays 1:30pm - 3:00pm in Heilmeier Hall online.

Course Description

In the new era of big data, we are increasingly faced with the challenges of processing vast volumes of data. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to process the data in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent is required. Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning (e.g., CIS 519) is helpful.

Grading

The grade breakdown will be as follows: homeworks (5 expected)-35%, term project-15%, midterm-20%, final-25%, participation 5%.

Late Days:

Students can submit two homework assignments (including HW0 and the project) up to 48 hours late with no penalty. Details are pinned on piazza.

Readings and Resources

Please sign up for the following Web resources:

Readings:

We recommend several books for students of different skill levels. The tentative list is:

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed from O'Reilly's Safari service.

For all students: Python for Data Analysis, by McKinney, from O'Reilly.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Office Hours

(Please see here to meet the staff!)

All OH will be hosted in Levine 5th floor bump space near the elevators(Not the inner part). The exception is Wednesday where the OH will be held in Professor Ives office in Levine 305.

Team Calendar

Schedule

(Subject to revision)

Schedule