Big Data Analytics

Logistics

Instructors: Profs. Jacob Gardner and Zachary Ives

Your fantastic TAs: Boom Devahastin Na Ayudhya, Ernest Ng, Phillip Chau, Vatsal Jain, Karan Jaisingh, Arjun Kanthawar, Bhairavi Muralidharan, Sara Nayak, Thiti (Nob) Premrudeepreechacharn, Aishwarya Ramanath, HyungSeok (Paul) Roh, Karan Sampath, Kavish Shah, Akanksha Tripathy, Shreyans Tiwari, Stephanie Walsh

This class will be offered in hybrid format, with both on-campus and remote learning opportunities.

  • Lectures will be Mondays and Wednesdays, 1:45pm - 3:15pm, Fagin Auditorium
    Lectures will also be live-streamed via Zoom. However, this is an in-person class and remote participation is intended to be used only for excused absences.
    Prerecorded video lectures will also be made available for a limited window, linked through the syllabus.

  • Recitations will be Fridays, 1:45pm - 3:15pm (same location). Recitations are optional, but highly recommended.

Course Description

In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 110, MCIT 590, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R) is helpful but not required.

Grading

The grade breakdown will be as follows:

  • homeworks (5-6 expected) 40%,

  • term project 20%,

  • midterm 15%,

  • final exam (2nd midterm) 15%,

  • quizzes (1 week deadline from lecture!) 7%,

  • participation (participating in person, posting to Ed) combine to make 3%.

Masking + COVID Policy:

This course will be held in-person and the expectation is that all students will participate in the classroom. Live and recorded video will be available for a limited window for students who are unable to join in-person. For the safety of several students who have health concerns, we are requiring that all students wear a mask in the classroom and in in-person office hours.

Late Days:

Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty. In other words, split across all homeworks, you have 48 late hours.

Collaboration Policy:

You are responsible for knowing Penn's Code of Academic Integrity. In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed. While you can share high level ideas and discuss concepts, you are NOT allowed to share code with each other. Needless to say, making answers to homework assignments or exams available to others either directly or by posting on the web is NOT allowed.

We will not have a sense of humor about violations of this policy!

Readings and Resources

  • Colab (for homeworks; you'll need a Google@SEAS or GMail ID)

  • Ed Discussion (questions, discussion)

  • Canvas (for access to the lecture recordings, which will be linked below)

  • Gradescope (for homework submission and exams; you'll be auto-added to this via Canvas)

  • OHQ (for less frustrating office hour queuing)

Readings:

We recommend several books for students of different skill levels..

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries. You may also find the UC Berkeley free book The Foundations of Data Science useful.

For all students: Python for Data Analysis, by McKinney, from O'Reilly. Again, an online version is accessible via the Penn libraries.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.


Office Hours

(Please see here to meet the staff!)

Office hours will include a mix of in-person and Zoom meetings. We will use OHQ to queue up.

Schedule

(Subject to revision)

CIS 5450 Fall 2022 Schedule