Time: 4:00-5:15pm on Tuesdays and Thursdays.
Location: Hasbrouck Laboratory, Room 134.
Slides and Reading by Lecture: See here.
Instructor: Mingda Qiao
Email: mqiao at umass dot edu
Office: CS Building 232.
Office hours: 2:30-3:30pm on Thursdays in CS 232.
Contact: If you need to chat or schedule an individual meeting, you can reach out via email, via a Piazza message, or in person, after class or during office hours.
Teaching Assistants:
Shiv Shankar
Email: sshankar at umass dot edu.
Office hours: Monday 11am-noon in CS 207 and Monday 4-5pm over Zoom.
Shreyas Chaudhari
Email: schaudhari at umass dot edu.
Office hours: Friday 11am-noon in CS 207 and Tuesday 2-3pm over Zoom.
Jinlin Lai
Email: jinlinlai at umass dot edu.
Office hours: Wednesday 1-2pm in CS 207 and Thursday 7-8pm over Zoom.
Online Sections: Andrew McGregor is teaching the online sections of COMPSCI 514 with asynchronous lectures that will closely parallel this section, but may diverge a bit as the semester progresses. Problem sets and exams will largely be coordinated across the two classes.
Course Description: With the advent of social networks, ubiquitous sensors, and large-scale computational science, data scientists must deal with data that are massive in size, arrive at blinding speeds, and often must be processed within interactive or quasi-interactive time frames. This course studies the mathematical foundations of big data processing, developing algorithms and learning how to analyze them. We explore methods for sampling, sketching, and distributed processing of large scale databases, graphs, and data streams for purposes of scalable statistical description, querying, pattern mining, and learning. 3 credits.
Prerequisites: The undergraduate prerequisites are COMPSCI 240 or STAT 515 (Probability) and COMPSCI 311 (Algorithms). This is a theoretical course with an emphasis on algorithm design, correctness proofs, and analysis. Aside from a general background in algorithms, a strong mathematical background, particularly in linear algebra and probability is required. If you are a masters student with a limited background in either of these subjects, please email me at the start of the semester to discuss your preparation.
Textbooks: This is no official textbook for this class. We will use some material from the following books:
Foundations of Data Science, Avrim Blum, John Hopcroft and Ravi Kannan.
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman and Jeff Ullman.
Probability and Computing, Michael Mitzenmacher and Eli Upfal.
Readings from these books and other sources will be posted before class.
Related Classes: You may also find some helpful reference material in these similar classes:
The Modern Algorithmic Toolbox, taught by Gregory Valiant at Stanford.
Sketching Algorithms for Big Data, taught by Piotr Indyk and Jelani Nelson at MIT and Harvard.
COMPSCI 514 last year (Fall 2024), taught by Cameron Musco.
Learning Objectives:
Students will learn about modern tools for data processing, including random sampling and hashing, low-memory streaming algorithms, linear and non-linear dimensionality reduction, spectral graph theory, and continuous optimization. A major goal is to be familiar at a high level with a breadth of algorithmic tools beyond combinatorial algorithms, which are the main focus of most undergraduate algorithms courses.
Through problem sets, students will develop the ability to apply and modify these algorithmic tools to tackle new problems, beyond those discussed in class. They will strengthen their ability to think creatively about algorithmic problems and push beyond known approaches, to develop solutions of their own.
Through assessments that emphasize formal proofs, students will strengthen their ability to formulate problems mathematically and analyze them rigorously.
Through algorithmic problems, students will practice applying fundamental tools in probability theory and linear algebra, which are broadly applicable in data science and machine learning. These include concentration bounds and methods for decomposing complex random variables, eigendecomposition, orthogonal projection, important matrix identities, and fundamentals of high-dimensional geometry and random matrix theory.
Piazza: We will use Piazza for class discussion, questions, and announcements. Sign up at this link. We hope for Piazza to be a key interactive component of the class. Thus, we encourage posting and good answering of other students' questions as part of up to 5% extra credit for class participation (see below).
Grade Components:
Problem Sets (5 total): 40%, split between core competency problems and challenge problems, see details below.
Weekly Quizzes: 10%, weighted equally, lowest score dropped.
Midterm exam: 25%.
Final exam: 25%.
Participation: Up to 5% extra credit (see below).
Class Participation: Up to 5% extra credit may be awarded for class participation. This may come in many forms, e.g.:
Asking good clarifying questions and answering questions during lecture.
Asking good clarifying questions and answering other students' or instructor questions on Piazza.
Posting helpful links on Piazza, e.g., resources that cover class material, research articles related to the topics covered in class, etc.
Grade Scale: The course is graded on a standard scale. This scale might be shifted down to account for any difficult exams/problems sets, but will never be shifted up. That is, if you obtain a 90% in the course, you will definitely achieve an A-, and potentially an A. If applicable, I will publish the shifted scale at the conclusion of the course. The standard grade scale is: A: ≥ 93, A-: [90, 93), B+: [87, 90), B: [83, 87), B-: [80, 83), C+: [77, 80), C: [73, 77), C-: [70, 73), D+: [67, 70), D: [63, 67), F: < 63.
Problem Sets: The problem sets will be split into two components: core competency questions and challenge questions.
Core competency questions are designed to help you master the key algorithmic and mathematical tools introduced in the course. They will be similar in difficulty to exam questions, and if you are able to solve them, you should be well prepared for the in-class exams.
You are expected to complete all core competency questions.
Challenge questions are designed to strengthen your ability to think creatively about algorithmic problems and push beyond known approaches, to develop solutions of your own. They will require significantly more time to digest and solve than core competency questions.
Each problem set will contain roughly two to three challenge questions, and you may chose which ones to complete. For each problem set, we will specify a number (typically one) that you must complete to get full credit. We will not generally assign extra credit for completing more than the required number.
Problem Set Submissions: Problem sets can be completed in groups of up to three students. If you work in a group, you submit a single problem set together. You may talk to people not in your group about the problem sets at a high level, but may not work through the detailed solutions together, write them up together, etc. We very strongly encourage you to work in a three person group, as it will give an advantage in doing the problem sets. At the beginning of the semester we will make a Piazza post where you can look for teammates.
While we encourage working in groups, we expect all members of a group to collaborate on and understand their submitted solutions. Some exam problems may closely resemble previous homework problems, and so understanding their solutions will be critical to your success in the course.
Problem set submissions will be via Gradescope. If working in a group, only one member of each group should submit the problem set, marking the other members in the group as part of the submission in Gradescope.
The entry code for Gradescope is WJ6RZ6.
No late homework submissions will be accepted unless there are extenuating circumstances, approved by the instructor before the deadline.
I strongly encourage students to type up problem sets using Latex. A Latex template for problem sets can be downloaded [TBA]. While it may seem cumbersome at first, getting proficient in Latex will save you a lot of time in the long run!
Weekly Quizzes: A quiz will be posted on Canvas each Thursday after class, due the following Monday at 8pm. These are short quizzes (designed to take ~15 minutes) to check that you are following the material and help me make adjustments if needed. Quizzes will include check-in questions asking for feedback on class pacing and on topics that need clarification, or that you would like to see discussed more. While we will not allow any excused misses of quizzes, the lowest quiz grade will be dropped, so that each student can miss one quiz during the course of the semester without it affecting their grade.
Exams: There will be a midterm exam and a final exam. The midterm will be held on Tuesday October 14th in the evening (7-9pm). The date, time and location of the final are TBA. Both exams will be closed notes. We will post extensive review material, past exams, and other practice questions to help you prepare. There is no option to take the exams remotely. Any makeup exams needed due to illness or other excused absences will be held in person.
Course Academic Honestly Policy: If caught violating the problem set or quiz rules, students will receive a 0% on the assignment for the first violation, and fail the class for a second violation. Any cheating on the midterm or final will lead to failing the class. For fairness, we apply these rules universally, without exceptions.
Generative AI Policy: Generative AI tools (such as ChatGPT, Copilot, and similar tools) are viewed as helpful but imperfect tutors who are not in your problem set group. You may not use them for the generation of submitted work including quizzes, problem sets, and exams. You may ask general questions related to the course material to get pointers, intuition, and insights. See here for the University statement on Responsible Use of Generative AI.
Academic Integrity Statement: UMass Amherst is strongly committed to academic integrity, which is defined as completing all academic work without cheating, lying, stealing, or receiving unauthorized assistance from any other person, or using any source of information not appropriately authorized or attributed. As a community, we hold each other accountable and support each other’s knowledge and understanding of academic integrity. Academic dishonesty is prohibited in all programs of the University and includes but is not limited to: Cheating, fabrication, plagiarism, lying, and facilitating dishonesty, via analogue and digital means. Sanctions may be imposed on any student who has committed or participated in an academic integrity infraction. Any person who has reason to believe that a student has committed an academic integrity infraction should bring such information to the attention of the appropriate course instructor as soon as possible. All students at the University of Massachusetts Amherst have read and acknowledged the Commitment to Academic Integrity and are knowingly responsible for completing all work with integrity and in accordance with the policy: https://www.umass.edu/senate/book/academic-regulations-academic-integrity-policy
Disability Accommodations: The University of Massachusetts Amherst is committed to providing an equal educational opportunity for all students. If you have a documented physical, psychological, or learning disability on file with Disability Services (DS), you may be eligible for reasonable academic accommodations to help you succeed in this course. If you have a documented disability that requires an accommodation, please notify me within the first two weeks of the semester so that we may make appropriate arrangements.
I understand that people have different learning needs, home situations, etc. If something isn’t working for you in the class, please reach out and let’s try to work it out.
Links to Other Helpful UMass Resources: