Course Info

Big Data and Cloud Systems

4190.684

Fall 2014

Seoul National University



Announcements

    9/30:

    • Project poster session : 11AM - 1PM 12/15 (Mon)
    • Project final report due 6PM 12/13 (Sat)

    9/18:

    • Project proposal due 6PM 9/30 (Tue).
    • Programming assignment out: 9/25 (Out); assignment due 6PM 10/18 (Sat)

    9/4: 

    • MapReduce, Dryad paper reviews due 11:59AM, 9/17 (Wed)

    9/2: The paper reviews are due 11:59AM before the lecture.



    Goal

    This is a research-focused class on big data and cloud systems. The course involves both a reading/lecture/discussion component and a project component. We will read recent research papers on big data and cloud systems. We will read papers from SOSP, OSDI, NSDI, EuroSys, USENIX ATC, SOCC, VLDB, SIGMOD, ICDE, etc. Students are expected to read papers before the class and submit a one page summary for each paper, and participate in the discussion during the class. A major portion of this course is a term project. The goal is to investigate new ideas and solutions. The students are expected to form small groups of 2-3 people to work on a few selected areas of big data and cloud systems. The project requires a proposal, a project update (presented in class), and a final report (both written and presented).



    Time

    • MW 3:30PM-4:45PM

    Location

    • Bldg. 302, Rm. 106

    Staff

    • Byung-Gon Chun
    • Brian Cho


    Course materials

    • Recent research papers on big data and cloud systems

    Evaluation (TBD)

    • 20% - paper reviews (#total papers), paper presentation, participation 
      • paper review score = 0, if # paper reviews <= limitreviews
      • paper review score = max_score * (# paper reviews - limitreviews) / (#total - limitreviews), if limitreviews <= # paper reviews <= #total 
      • 1 paper presentation =~ 4 paper reviews
    • 20% - Programming assignment
    • 60% - Class project (project proposal, project presentation, project report)
      • Proposal - 5%
      • Mid check 1 (Nov. 5) - 10%
      • Mid check 2 (Nov. 26) - 10%
      • Final poster (Dec. 15) - 15% (missing demo: -5%)
      • Final report (Dec. 13) - 20%

    Syllabus (Tentative Schedule)

    • Week 1: Introduction. Paper presentation sign-up. (2 papers)
    • Week 2: Thanksgiving days (no class) (make-up class)
    • Week 3: Guest lecture (2 papers)
      • M (9/15) - Guest lecture (Cosmos, Scope, PeriScope), Microsoft
      • W: MapReduce [Slides], Dryad [Slides]
    • Week 4: Programming assignment 9/25 (2 papers)
      • Week 5: (2 papers) : Project proposal due 9/30
        • M - Pig [Slides], Hive [Slides]
        • W - "Proposal presentation" 
      • Week 6: (0 paper) (OOT - NO class)
      • Week 7: (0 paper): Programming assignment due
        • M - Redhat talk
        • W - No meeting
      • Week 9: (4 papers)
        • M - SparkStreaming [Slides], TimeStream [Slides]
        • W - REEF tutorial Hello REEF Yarn [Slides]
      • Week 10: (4 papers)
        • M - REEF tutorial [Slides] [FAQ]
        • W - Pregel [Slides], Distributed GraphLab [Slides]
        • "Project progress check" (W 6:30PM)
      • Week 11: (4 papers)
      • Week 12: (4 papers)
        • M - LightWeight (SOSP 2013) [Slides], X-Stream (SOSP 2013) [Slides]
        • W - MillWheel [Slides], Distributed aggregation [Slides]
      • Week 13: (4 papers)
        • M - GFS [Slides], Bigtable [Slides]
        • W - Project progress check
      • Week 15: Final project report due Friday
        • M - Guest lecture, Big Data
        • W - Chubby [Slides], Spanner [Slides]
        • Other papers - Zookeeper, MegaStore
      • Week 16: Project poster session

      * TBD: One or two more guest lectures


      Paper reviews

      The goal of these reviews is to help you synthesize the main ideas and concepts presented in each paper. You have to write your reviews in English. The review text is limited to 250 words. 
      Email reviews in text to Brian and me (bgchun AT snu DOT ac DOT kr, chobrian AT gmail DOT com), with the subject line ``[BDCS] Review-$lecturedate$-$firstauthor$'' where $lecturedate$ is in the standard MMDDYY format and $firstauthor$ is the last name of the first author. Reviews are due by 11:59am before the lecture.

      References on how to write paper reviews


      Programming assignment

      • Exercise with REEF, Hadoop, other Big Data tools, and AWS/Azure.
      • Build an app (e.g., a simple machine learning algorithm) on REEF/Hadoop and deploy it on AWS/Azure => Production-quality code will be contributed to Apache


      Project

      • Project proposal (in English) (2 pages).
        • If you’re already doing research on related topics, you’re welcomed to carve out your class project from your research. Your research topic should have non-trivial system building; theory-oriented projects are not allowed. If you are new to this area, we can suggest potential class project topics for you. For REEF-related projects, we can give you more detailed feedback with the help from Microsoft.
      • Project proposal feedback - face-to-face meetings (with Brian and me)
      • Project checkup - face-to-face meetings (with Brian and me)
      • Project report (up to 2-column 10 pages excluding references)
      • Project poster

      * Note. We strongly suggest writing your course project proposal and report using LaTeX. It is the de-facto tool in which most CS research papers are written.



      Potential Project Topics

      • Tez on REEF: Apache Tez is a framework that expresses data flow and processing as a DAG, and executes it on Yarn. Jobs for Hive, Pig, etc. use Tez. This project involves implementing a solution to run the Tez framework on REEF. In addition, the project should propose and answer open-ended questions on how Tez on REEF will improve Tez and/or REEF. Possible directions include: REEF apps can make use of Tez DAGs, REEF can improve Tez elasticity, etc.
      • Shuffle service in REEF: Shuffle is an important data movement abstraction, most famously used in MapReduce. Implementation can be tricky because of the large amounts of data, fault tolerance, and concurrency involved. This project involves implementing Shuffle as a service in REEF. The implementation should expose the Shuffle service to REEF apps with a clean API. It should also provide comparable performance and robustness to shuffle in Hadoop MapReduce. An experimental comparison should be made with Hadoop's shuffle implementation.
      • Torque-like scheduler: Many applications require running identical jobs, but with slightly different configuration. For example, an application that finds the right parameters for a machine learning algorithm based on a fixed sample may use such an approach. The Torque scheduler for HPC environments provides easy configuration for these situations. This project involves implementing a low-latency Torque-like scheduler for REEF, and applying it on interesting applications and datasets.
      • Cross-framework DAG scheduler: REEF supports many data processing frameworks, and applications. Many data applications require multiple processing frameworks and applications to be run in stages to produce a final result. In this project, you will develop a cross-framework scheduler that connects the stages as a directed acyclic graph (DAG). The design should provide a clean API and scheduling model, with good performance.
      • MLlib on REEF: MLlib defines a set of common machine learning algorithms, and an implementation in Spark. In this project, you will implement the Machine Learning API as new algorithms in REEF. A performance comparison should be done with the Spark algorithms.
      • Linear algebra lib on REEF: Linear algebra is a fundamental part of solving engineering and computational science problems. In this project, you will implement a library of scalable linear algebra algorithms on REEF.
      • Tajo on REEF: Apache Tajo is a big data warehouse system built on Hadoop. Gruter, a Korean startup company, develops Tajo with other companies such as SKT. Currently Tajo runs with Hadoop v1. In this project, you will implement a solution to run Tajo on REEF to decouple the dependency of Tajo with Hadoop v1.
      • Deep learning on REEF: Deep Learning is a set of Machine Learning algorithms that have seen considerable interest recently because of their application to signal and information processing (such as speech recognition, vision recognition, etc.). The challenge is to how to efficiently process complex distributed neural networks.
      • Wake visualizer for troubleshooting: Wake is the event-driven framework that REEF is built on. As the already large scale and high concurrency continues to increase, subtle performance bugs can crop up that are difficult to grasp. In this project, you will develop a pluggable instrumentation scheme for Wake and a visualizer that shows Wake status, in terms of request rate, queue lengths, thread pool sizes, computations times, etc. 
      • Wake resource auto-tuning: Wake currently performs static partitioning of thread pools among stages. In this project, you will develop a local execution engine that tunes thread allocation to stages dynamically based on stage loads.
      • Pregel/AsynchronousGraphProcesing on REEF: Pregel is a Bulk Synchronous Parallel (BSP) implementation for graph processing developed by Google. Another approach is to implement asynchronous algorithms for graph processing. By making use of REEF's elastic primitives, gains with the asynchronous approach are anticipated. In this project, you will implement graph algorithms in both approaches and make a comparison of their performance and results.
      • Tasklet: REEF currently runs a Task per Evaluator. Certain workloads can see advantages by running multiple small Tasks, or Tasklets, concurrently inside an Evaluator. In this project, you will define an API for Tasklets, implement them, and run performance comparisons on workloads between Task and Tasklet configurations. 
      • Spark block manager on REEF: Spark stores blocks of data as RDDs, which is maintained by the block manager component. We have developed a Spark on REEF, which uses REEF as the runtime for Spark. In this project, you will implement a REEF-managed block manager, which can allow the current Spark job, other Spark jobs or frameworks to access the RDDs.
      • Slider on REEF: Slider is a framework to make it easy to deploy and manage long-running static applications in a YARN cluster. The focus is to adapt existing applications such as HBase and Accumulo to run on YARN with little modification. This project explores implementing the Slider functionality on REEF and demonstrates that such a service can be easily implementable with REEF.
      • DSL for REEF: REEF applications are implemented as Java code in separate classes for Task, Driver, Launch, etc. In this project, you will take an alternative approach: you will create a domain specific language (DSL) that can define locally the configuration and execution of a REEF application. A good candidate for a DSL in the JVM is Scala.
      • DSL for ML: In this project, you will create a domain specific language (DSL) that can define the configuration and execution of machine learning algorithms.
      • DSL for Graph Algorithms: In this project, you will create a domain specific language (DSL) that can define the configuration and execution of graph algorithms in both BSP (Pregel) and asynchronous forms.
      • Interoperation of RxJava and Wake: RxJava is a event-based programming framework developed by Netflix. In this project, you will develop bindings that can allow RxJava and Wake based code to interoperate cleanly.
      • Scala binding for Tang: Tang is the dependency injection framework for configuring distributed systems developed with REEF. The Tang APIs give configurations that are strongly typed and easily verified for correctness. In this project, you will implement Scala bindings for Tang.
      • Python binding for Tang: In this project, you will implement Python bindings for Tang.
      • Tang API implemented atop Guice: In this project, you will implement the Tang API atop Google Guice.
      * Note. Some topics suggested above may not be challenging enough as a team project. In that case, you need to broaden the scope to present the topic for your proposal.


      Honor code and course-related behavior policy

      We will apply the standard school policy to this course.