Big Data Analytics

Logistics

Instructors: Profs. Susan Davidson and Zachary Ives

Your fantastic TAs: Vian Djianto (Head TA), Carol Li (Head TA), Victor Castillo, Phillip Chau, Kunaal Chaudhari, Boom Devahastin Na Ayudhya, Calvin Hu, Michael Lau, Parmita Mishra, Ang Li, Ernest Ng, Prakruthi Raghav, Aditya Rathi, Paul Roh, Kavish Shah, Shreyans Tiwari, Warren Wang, Kailin Zheng

This class will be offered in hybrid format, with both on-campus and remote learning opportunities. Given Penn's new COVID-19 guidelines, the first 2 lectures will be delivered remotely, 12:00 - 1:30pm, via Zoom. Thereafter:

  • Lectures will be Mondays and Wednesdays, 12:00pm - 1:30pm, in Towne 100 (Heilmeier Hall)
    Lectures will also be live-streamed via Zoom.
    Prerecorded video lectures will also be made available, linked through the syllabus.

  • Recitations will be Fridays, 1:45pm - 3:15pm in Towne 100. Note the different time slot from the lecture! Zoom link and materials are posted weekly in Piazza. The presentation portions of recitations will be recorded. ALL recitation materials are self-contained within the Piazza post! Recitations are optional, but highly recommended.

Course Description

In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R), and machine learning (e.g., CIS 519) is helpful.

Grading

The grade breakdown will be as follows:

  • homeworks (5-6 expected) 40%,

  • term project 20%,

  • midterm 15%,

  • final exam (2nd midterm) 15%,

  • engagement (attending lecture, watching videos, doing self-check quizzes) and participation (participating in person, posting to Piazza, asking Zoom questions) combine to make 10%.

Late Days:

Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty. In other words, split across all homeworks, you have 48 late hours.

Collaboration Policy:

You are responsible for knowing Penn's Code of Academic Integrity. In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed. Making answers to homeworks or exams available to others either directly or by posting on the web is NOT allowed.

We will not have a sense of humor about violations of this policy!

Readings and Resources

  • Colab (for homeworks; you'll need a Google@SEAS or GMail ID)

  • Piazza (questions, discussion)

  • Canvas (for access to the lecture recordings, which will be linked below)

  • Gradescope (for homework submission and exams; you'll be auto-added to this via Canvas)

  • OHQ (for less frustrating office hour queuing)

  • Please contact your staff in case of any trouble accessing these resources!

Readings:

We recommend several books for students of different skill levels..

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries.

For all students: Python for Data Analysis, by McKinney, from O'Reilly. Again, an online version is accessible via the Penn libraries.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Office Hours

(Please see here to meet the staff!)

Office hours will include a mix of in-person and Zoom meetings. We will use OHQ to queue up.

Schedule

(Subject to revision)

CIS 545 Spring 2022 Schedule