Expose students to the underlying principles of working with large data sets. In particular, distributed data processing and computing with partial views of data. Students will gain an understanding of the tradeoffs integral to analyzing big data. This will be illustrated with assignments that apply machine learning algorithms to large data sets. Students will gain a working knowledge of technologies such as Hadoop, Spark, and H20. Finally, they will get an insight into their implementation.
Students should have take the NEU graduate algorithms class. They should have experience programming in Java and be able to use tools such as source code revision systems, etc. There is also an expectation that students enjoy programming and strive to write beautiful code.
The course has weekly programming assignments and a capstone project. Work will be done in groups of up to four. Each group assignment, will be reviewed by another group and a random subset of assignments will be codewalked live in class. Students are expected to study the reading material ahead of class and ask questions on the course discussion forum. The capstone project will involve writing a significant piece of software and a report. The expected workload is 15 hours a week but may vary accordingly to skill level and difficulty of the assignment.
The course grade will consist of a class participation component (25%), a forum help component (15%), a group project report (30%), an individual codewalk (30%) along with some extra credit. The grading scale is A+ = 95%, A = 90%, A- = 85%, B+ = 80%, B = 75%, B- = 70%, C = 65%, D = 55%, F. The participation grade will be based on answering questions in class. The forum help grade is based on making helpful comments on Piazza. The report is a document describing the software project and evaluating it rigorously. Individual codewalks will be conducted during the last two weeks of the course and will consist of short Q&A session focusing on the code authored by the student. The grade will be based on the nature of the code and the student's answers.
Every file, class, and method should have a comment describing authorship of the code. This should include a single initial author (or owner), and possibly multiple maintainers. It is acceptable to use code from another group as long as that code is accurately attributed to its owner. Deleting or changing authorship comments is cheating.
Copying code without attribution is cheating. Cheating means an F to all involved for the course.
This class follows a flipped classroom model. There will be few traditional lectures, instead we will be working together either looking at code or discussing designs. Online learning material will be provided as well as handouts and research articles. Students are expected to attend class and should notify the instructor in advance of any absences. Each class will have a scribe whose role will be to record the discussion in a shared document.
Group A meets in Sneel 308, on Ws from 2:50 to 4:30.
Group B meets in Behrakis 220 on Ws from 6:00 to 7:40.
Students are expected to read [HTDG] as well as the papers assigned on weekly basis.
[HTDG, "Hadoop The Definitive Guide", by Tim White, O'Reilly]
[FJ10, "FlumeJava: Easy, Efficient Data-Parallel Pipelines", Chambers et al., PLDI'10, PDF]
[SK12, "Possible Hadoop Trajectories" By Stonebraker, Kepner, CACM'12 Text]
[S14, "Hadoop at a Crossroads?" By Michael Stonebraker, CACM'14 Text]
[M3R12, "Increased Performance for In-Memory Hadoop Jobs", by Shinnar et al., VLDB'12 PDF][RDD12, "A Fault-Tolerant Abstraction for In-Memory Cluster Comp." Zaharia e.a., NSDI'12 PDF]
[S14, "Hadoop at a Crossroads?" By Michael Stonebraker, CACM'14 Text]
[M3R12, "Increased Performance for In-Memory Hadoop Jobs", by Shinnar et al., VLDB'12 PDF]
[VLDB12, "The Performance of MapReduce: An In-depth Study", by Jiang et al., VLDB'10, PDF][MAS11, "Evaluating MapReduce Performance Using Workload Suites", Chen e.a., MASCOTS'11 PDF]
[OS06, "Bigtable: A Distributed Storage System for Structured Data", Chang et al., OSDI'06 PDF]
|1/14||Hadoop introduction (self-study)|
Read: HTDG Ch 1-2
|1/21||Working with big data|
Read: HTDG Ch 5, 6, 7
| 1/28||SNOWed out|
Read: HTDG Ch 5, 6, 7 (read again, in depth)
|2/04||Benchmarking and performance measurements|
Read: HTDG Ch 5, 6, 7, 8, 16
Due: [A2] [A1 reviews]
|2/18||Due: [A2 reviews]|
| 3/04||Read: SK12, S14|
Due: [A3 reviews]
|3/18||Read: RDD12, M3R12|
Open office week: each team should drop by my office to discuss status and conduct any remaining class participation Q&A.
|4/15||Project reports due by email 4/19|
Due: [Project reviews]
- [A0] M/R Intro: Given a file with two columns, one is a product category and the other is a price, write a map/reduce task that outputs the median of purchases in each category. (Individual project)
- [A1] Median Redux: Starting from your implementation of A0, we will look at performance of MR. As input, your program should accept a file with tab separated columns. The fourth column contains free form text describing an item category (for instance, "Children's toys") and the fifth a price in dollars and cents (for instance, 12.33). Your program should compute the median of purchases by item category, and it should be robust to malformed inputs. Write at least the following versions of your code. (v1) Java sequential, (v2) Map Reduce (sort in Reduce), (v3) Map Reduce (using composite keys to let MR do the sort), (v4) Map Reduce (either form) where each call to map also computes Fibbonacci of N. If you are inspired, you can also try a plain Java parallel version. Your goal here is to explore performance of this simple application, play with settings of MapReduce and with the way you write the code. Write a report that summarizes your results. The report should answer the following questions: (Q1) What is the performance difference between v1, v2 and v3 where you run v2 and v3 both in a single Hadoop process and "pseudo" distributed processes. (Q2) Comparing v(2|3) with v4, what is the largest value of N that does not affect performance? (Q3) How many instances of the reducer are running? The input file is (here 1.7GB txt file). Your submission should include: source code, a README file describing how to use. Compiling, building the jar and running should be automated. Try to make sure that your script does not have too many dependencies on you file structure so that reviewers can test your work. LaTeX is recommended for document preparation and R for any graphing tasks. Submission: email homework to firstname.lastname@example.org, with the following information: a) Subject: HW number, [A1], b) Naming attachment: as LastName_FirstName.zip. For eg: Vitek_Jan.zip. c) Format: send files zipped. Any doubts regarding this on the piazza. As with other courses, if there are multiple submissions, the most recent one would be taken for code review. (Individual project)
- [Codereview 1] You shall prepare a code review document, due next Tuesday 6PM. The document shall be a PDF file that you will email to the author of the code and our TA. A code review document has three parts: (A) Report, (B) Code and (C) Packaging. For part (A), you should look at the report and see if the report give sufficient information for you to understand what was achieved. Is the report clear? Is it complete? Does describe the right things? Are the graphs well done? For part (B), you should read the code an provide comments on code quality. Is the code commented? Is it well structured? Is there code repetition? Are there obvious bugs or performance issues? Is the code robust to exceptional conditions? For part (C), you should comment about the build files and run scripts. Was it easy for you to run the code? Was it possible to replicate the numbers in the report? Was all the data and code provided?
- [A2] Median^3: One opportunity to clean up your code from A1, and add v4, a Hadoop version that computes an approximate median, the goal is to avoid shuffling all the data to the reducer. This will require changing the data sent to map, and possibly experimenting with combiners. We will pay attention to code quality and to rigorous measurement of performance. The command line arguments for v4 can (but don't need to) include a sample rate and a bin size. Where the sample rate tells your code how much of the data set to process, and the bin size may be used to limit in how many groups of prices you want to aggregate your data. These are only hints, do with them what you will. Measure the performance of your code on AWS as well as your local machine. Submission: same procedure as A1. Due: Next Wednesday at noon. (Groups of two)
- [Codereview 2] For this code review you'll work in pairs (the same as A2) to review some other pair's A2. Your goal is to reproduce as much as possible. Be curious about the code, question the results. Don't waste space in your report on what works well -- spend your time on what could be improved. Avoid generalities (i.e. "Variables names could be better") instead be precise and give examples (i.e. "In function Foo, variables the names of variables, x43 and y7, do not communicate their role."). Submission: email homework to email@example.com, and to the author of the code you are reviewing. Due: 2/18 (Groups of two)
- [A3] Analyzing Airline Data: This assignment will have you analyze a large trove of airline data to detect some patterns. You will build a model and use that model to predict delays. This tarball contains sample data for you to start working, this is not the final input (that one will be much bigger). File "data.csv" contains airline data for one month. Use it to create a model. Then try to predict for each row of file "predict.csv" if the plane will be delayed by at least 15 minutes. Finally, you can check you predictions with the contents of the file "check.csv". Real data for the assignment is all.csv, check.csv, predict.csv. The goal of the assignment is to automate the model construction in Hadoop so that if you get a different input you can create a model and predict results with it. You should submit one Java program which recognizes the following command line switches: -learn, -predict, and -check. When called with -learn, the program will take the name of a file and run a M/R process that will analyze that file output a file called "model.m" (the contents of which are up to you), when your program is run with -predict, it takes two arguments, the "model.m" file and a file with airline information without actual times, it will output the same airline file back with the row ArrDel15 filed with your prediction. Lastly when you program is given the -check option it takes an airline file with predictions and a file with actual values and outputs the number of correctly predicted delays and the number of incorrectly predicted delays. Due: 2/25. (Groups of four)
A code review will be conducted after each task. Each team will review two other teams' work. The review will comment on code quality, design, documentation, provided tests, etc. The goal of the code review is to produce actionable suggestions for improvements. The code review will be summarized textually and discussed in class.
Students will implement their own MapReduce framework from first principles. This framework will be similar to Hadoop but use a streamlined API and strive for minimality.
100 points of extra credit are available for students particularly helpful on the discussion forum. Extra credit will be awarded to student producing artifacts used by other teams.
The instructor is available on Wednesdays only. The class has no TA. Students are expected to be self-sufficient and to help each other.