Big Data Analytics


Instructor: Prof. Zachary Ives,

Your fantastic TAs: Suyog Bobhate, Kunaal Chaudhari, Jason Chen, Jonathan Choi, Andrew Clark, Andrew Cui, Ria Gandhi, Hoyt Gong, Karishma Jain, Akhi Khakhar, Carol Li, Younghu Park, Rohil Sheth, Arielle Stern, Arth Talati, Marko Zotovic

Due to the size and scale of the class, this class will be offered via remote learning. But please don't let that be a barrier to getting to know us!

  • Lectures will be prerecorded and made available asynchronously through the schedule below.

  • Synchronous activities and Q&A will be on Zoom, Wednesdays, 1:30pm - 2:30pm (the link will be in Piazza).

  • Recitations will be on Fridays, 1:30 - 2:30 via Zoom (the link will be in Piazza, provided on a weekly basis). The presentation portions of recitations will be recorded. Recitations are optional, depending on the topic, but recommended.

  • office hours (link on Piazza) will be in 305 Levine, Mondays 2:00-3:00, and Wednesdays 9:00-10:00. Additional time can be booked via the link above!

Course Description

In the new era of big data, we are increasingly faced with the challenges of processing vast volumes of data. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to process the data in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.


This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning (e.g., CIS 519) is helpful.


The grade breakdown will be as follows:

  • homeworks (5 expected) 40%,

  • term project 20%,

  • midterm 15%,

  • final exam (2nd midterm) 15%,

  • engagement (watching videos, doing self-check quizzes) and participation (posting to Piazza, asking Zoom questions) combine to make 10%.

Late Days:

Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty. In other words, split across all homeworks, you have 48 late hours.

Readings and Resources

  • Colab (for homeworks; you'll need a Google@SEAS or GMail ID)

  • Piazza (for questions)

  • Canvas (for access to the lecture recordings, which will be linked below)

  • Gradescope (for homework submission and exams; you'll be auto-added to this via Canvas)

  • (for breakout groups and study groups)


We recommend several books for students of different skill levels..

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries.

For all students: Python for Data Analysis, by McKinney, from O'Reilly. Again, an online version is accessible via the Penn libraries.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Office Hours

(Please see here to meet the staff!)

Office hours will be held via OHQ and Zoom; and we'll be trialing More details will be provided soon!


(Subject to revision)

CIS 545 Fall 2020 Schedule