Instructor: Prof. Zachary Ives, office hours Mondays, 4:30pm - 5:30pm
Your fantastic TAs: Ben Chan, Vivian Xiao, Aeshon Balasubramanian, Pedram Bayat, Caroline Chen, Xiang Chen, Praj Chirathivat, Jackson Gold, Millie Gu, Antonios Kreizis, Matthew Kuo, Pak Kanjanakosit, Hannah Lu, Zora Mardjoko, Allison Mi, Vinay Padegal, Hassan Rizwan, Henry Sims, Steven Su, Seth Sukboontip, Term Taepaisitphongse, Aashika Vishwanath, Brandon Yan, Megan Yang, Tommy Yu, Allan Zhang
Classroom: Meyerson B1. Lectures will be recorded, but in-person attendance is expected.
Lectures on Mondays and Wednesdays, 1:45pm - 3:15pm in Meyerson B1
Recitations on Fridays, 1:45pm - 3:15pm in Meyerson B1
In the era of big data, we are increasingly faced with the challenges of converting massive amounts of data to actionable knowledge. Given the limits of individual machines (compute power, memory, bandwidth), increasingly the solution is to clean, integrate, and process the data using statistical machine learning techniques, in parallel on many machines. This course focuses on the fundamentals of scaling computation to handle common data analytics tasks. You will learn about basic tasks in collecting, wrangling, and structuring data; programming models for performing certain kinds of computation in a scalable way across many compute nodes; common approaches to converting algorithms to such programming models; standard toolkits for data analysis consisting of a wide variety of primitives; and popular distributed frameworks for analytics tasks such as filtering, graph analysis, clustering, and classification.
This course expects broad familiarity with probability and statistics, as well as programming in Python. CIS 1100, MCIT 5900, or the equivalent are required. Additional background in statistics, data analysis (e.g., in Matlab or R) is helpful. For CIS, ESE, and Data Science students, CIS 5450 is very appropriate as a course before CIS 5190 or 5200 although the courses can be sequenced in any order.
The grade breakdown will be as follows:
homeworks (5-6 expected) 40%,
term project 20%,
midterm 15%,
final exam (2nd midterm) 15%,
in-class and after-class quizzes, and detailed exercises, 7%,
participation (participating in person, posting to Ed) combine to make 3%.
Students can submit homework assignments (including HW0 but not the term project), up to 48 hours late cumulatively, with no penalty. In other words, split across all homeworks, you have 48 late hours. If you exceed those 48 hours, a penalty of -1% per hour will be applied to that HW. Please note that we will be rounding up (e.g. Gradescope will count a submission that is even 1 minute late as using up 1 late hour)
You are responsible for knowing Penn's Code of Academic Integrity. In particular, copying solutions from other students or other resources (e.g. the Web or from students who have taken the class in previous years) is NOT allowed. While you can verbally discuss high level ideas and discuss concepts, you are NOT allowed to share code with each other. Needless to say, making answers to homework assignments or exams available to others either directly or by posting on the web is NOT allowed.
We will not have a sense of humor about violations of this policy!
Modern AI tools can be of great help in understanding concepts, and we have no concerns about you using ChatGPT, Bard, etc. to get alternative explanations for topics. However -- given that we are trying to teach general, reusable skills: we expect you to write your code without help from an LLM or from a classmate. Please note that the exams will be tailored with this in mind (not focused on syntactic details, but on the ability to tackle problems, including those that appear in the homework) so you should make sure you can solve problems on your own without AI help!
Colab (for homeworks; you'll need a Google@SEAS or GMail ID)
Ed Discussion (questions, discussion)
Canvas (for access to the lecture recordings, which will be linked below)
Gradescope (for homework submission and exams; you'll be auto-added to this via Canvas)
OHQ (for less frustrating office hour queuing)
Please contact your staff in case of any trouble accessing these resources!
We recommend several supplementary books for students of different skill levels..
Python for Data Analysis, by McKinney, from O'Reilly. An online version is accessible via the Penn libraries.
Python Machine Learning, 3rd edition by Raschka, from Packt. Again, an online version of this book is also accessible via the Penn libraries.
If you are new to Python and data science, you may find the UC Berkeley free book The Foundations of Data Science useful.
(Subject to revision!)