Big Data Analytics

Logistics

Instructor: Prof. Zachary Ives

Your fantastic TAs: Boom Devahastin Na Ayudhya, Phillip Chau, Kavish Shah, Liang-Yun Cheng, Federico Cimini, Vatsal Jain, Karan Jaisingh, Arnav Jhaveri, Jeffrey Li, Yixuan Meng, Bhairavi Muralidharan, Yash Nakadi, Sara Nayak, Thiti (Nob) Premrudeepreechacharn, Aishwarya Ramanath, Karan Sampath, Parth Sheth, Akanksha Tripathy, Shreyans Tiwari, Nicky Wongchamcharoen

Classroom: Myerson Hall Room B1.  Lectures will be offered via Zoom streaming and recordings, but classroom attendance is highly encouraged! 

(Previous iterations of the course: Fall 2022, Spring 2022)

Course Description

In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 1100, MCIT 5900, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R) is helpful.  For CIS, ESE, and Data Science students, CIS 5450 is very appropriate as a course before CIS 5190 or 5200 although the courses can be sequenced in any order.

Grading

The grade breakdown will be as follows:  

Masking + COVID Policy:

This course will be held in-person and the expectation is that all students will participate in the classroom.  Live and recorded video will be available for a limited window for students who are unable to join in-person.  For the safety of several students who have health concerns, we are requiring that all students wear a mask in the classroom and in in-person office hours.

Late Days:

Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty.  In other words, split across all homeworks, you have 48 late hours. If you exceed those 48 hours, a penalty of -1% per hour will be applied to that HW. Please note that we will be rounding up (e.g. Gradescope will count a submission that is even 1 minute late as using up 1 late hour)

Collaboration Policy:

You are responsible for knowing Penn's Code of Academic Integrity.  In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed.  While you can verbally discuss high level ideas and discuss concepts, you are NOT allowed to share code with each other. Needless to say, making answers to homework assignments or exams available to others either directly or by posting on the web is NOT allowed.

We will not have a sense of humor about violations of this policy! 

Readings and Resources

Readings:

We recommend several books for students of different skill levels..

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries.

For all students: Python for Data Analysis, by McKinney, from O'Reilly.  Again, an online version is accessible via the Penn libraries.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Office Hours

(Please see here to meet the staff!)

Office hours will include a mix of in-person and Zoom meetings.  We will use OHQ to queue up, but note there are no OH during Spring Break.

Schedule

(Subject to revision)

CIS 5450 Schedule, Spring 2023