Big Data Analytics

Logistics

Instructor: Prof. Zachary Ives

Your fantastic TAs: Arnav Jhaveri (Head TA), Jeffrey Li (Head TA), Tahmid Ahamed, Liang-Yun Cheng, Federico Cimini, Yash Nakadi, Emily Liu, Arush Mehrotra, Ben Chan, Joseph Lee, Michael Lu, Aashvi Manakiwala, Karan Sampath, Akanksha Tripathy, Sharan Venkatesh, Nicky Wongchamcharoen 

Classroom: Fagin Auditorium.  Lectures will be recorded and posted, but in-person attendance is highly encouraged.

(Previous iterations of the course: Spring 2023, Fall 2022, Spring 2022)

Course Description

In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.

Prerequisites

This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 1100, MCIT 5900, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R) is helpfulFor CIS, ESE, and Data Science students, CIS 5450 is very appropriate as a course before CIS 5190 or 5200 although the courses can be sequenced in any order.

Grading

The grade breakdown will be as follows:  

Late Days:

Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty.  In other words, split across all homeworks, you have 48 late hours. If you exceed those 48 hours, a penalty of -1% per hour will be applied to that HW. Please note that we will be rounding up (e.g. Gradescope will count a submission that is even 1 minute late as using up 1 late hour)

Collaboration Policy:

You are responsible for knowing Penn's Code of Academic Integrity.  In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed.  While you can verbally discuss high level ideas and discuss concepts, you are NOT allowed to share code with each other. Needless to say, making answers to homework assignments or exams available to others either directly or by posting on the web is NOT allowed.

We will not have a sense of humor about violations of this policy! 

AI/Large Language Model (LLM) Policy:

Modern AI tools can be of great help in understanding concepts, and we have no concerns about you using ChatGPT, Bard, etc. to get alternative explanations for topics.  However -- given that we are trying to teach general, reusable skills -- we expect you to write your code without help from an LLM or from a classmate.  Please note that the exams will be tailored with this in mind (not focused on syntactic details, but on the ability to tackle problems) so you should make sure you can solve problems on your own!

Readings and Resources

Readings:

We recommend several books for students of different skill levels..

For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries.

For all students: Python for Data Analysis, by McKinney, from O'Reilly.  Again, an online version is accessible via the Penn libraries.

For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.

If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.

Office Hours

(Please see here to meet the staff!)

Office hours will include a mix of in-person and Zoom meetings.  We will use OHQ to queue up.

Schedule

(Subject to revision)

CIS 5450 Fall 2023 Schedule