Big Data Analytics
Logistics
Instructor: Prof. Zachary Ives
Classroom: Fagin Auditorium. Lectures will be recorded and posted, but in-person attendance is highly encouraged.
Lectures will be Mondays and Wednesdays, 1:45pm - 3:15pm in Fagin Auditorium (School of Nursing -- cut through Johnson Pavilion from campus).
Prerecorded video lectures will also be made available, linked through the syllabus.Recitations will be Fridays, in the same time slot and location.
Materials will be posted weekly in Ed Discussion. Recitations are optional, but highly recommended.
(Previous iterations of the course: Spring 2023, Fall 2022, Spring 2022)
Course Description
In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.
Prerequisites
This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 1100, MCIT 5900, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R) is helpful. For CIS, ESE, and Data Science students, CIS 5450 is very appropriate as a course before CIS 5190 or 5200 although the courses can be sequenced in any order.
Grading
The grade breakdown will be as follows:
homeworks (5-6 expected) 40%,
term project 20%,
midterm 15%,
final exam (2nd midterm) 15%,
quizzes (1 week deadline from lecture!) 7%,
participation (participating in person, posting to Ed) combine to make 3%.
Late Days:
Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty. In other words, split across all homeworks, you have 48 late hours. If you exceed those 48 hours, a penalty of -1% per hour will be applied to that HW. Please note that we will be rounding up (e.g. Gradescope will count a submission that is even 1 minute late as using up 1 late hour)
Collaboration Policy:
You are responsible for knowing Penn's Code of Academic Integrity. In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed. While you can verbally discuss high level ideas and discuss concepts, you are NOT allowed to share code with each other. Needless to say, making answers to homework assignments or exams available to others either directly or by posting on the web is NOT allowed.
We will not have a sense of humor about violations of this policy!
AI/Large Language Model (LLM) Policy:
Modern AI tools can be of great help in understanding concepts, and we have no concerns about you using ChatGPT, Bard, etc. to get alternative explanations for topics. However -- given that we are trying to teach general, reusable skills -- we expect you to write your code without help from an LLM or from a classmate. Please note that the exams will be tailored with this in mind (not focused on syntactic details, but on the ability to tackle problems) so you should make sure you can solve problems on your own!
Readings and Resources
Colab (for homeworks; you'll need a Google@SEAS or GMail ID)
Ed Discussion (questions, discussion)
Canvas (for access to the lecture recordings, which will be linked below)
Gradescope (for homework submission and exams; you'll be auto-added to this via Canvas)
OHQ (for less frustrating office hour queuing)
Please contact your staff in case of any trouble accessing these resources!
Readings:
We recommend several books for students of different skill levels..
For students who do not have at least 2 years of a CS degree: You should get the book Data Science from Scratch, 2nd ed, by Grus, from O'Reilly. This book provides a quick refresher in Python, probability, statistics, and linear algebra. An online version can be accessed through the Penn libraries.
For all students: Python for Data Analysis, by McKinney, from O'Reilly. Again, an online version is accessible via the Penn libraries.
For advanced students: Python Machine Learning, 3rd edition by Raschka, from Packt. And indeed, an online version of this book is also accessible via the Penn libraries.
If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.
Schedule
(Subject to revision)