ELEN 6889: Fundamentals of Stream Processing

Course Outline

The world is experiencing an explosion in the amount of data being produced and collected - estimated to have crossed over 10s of Zettabytes (10^21). Online data is produced continuously by different types of sensors, processes, and human activities, especially in an Internet of Things world. Being able to analyze this large amount of data as it is generated continuously, and distill insights for improved decision making, is vital to the functioning of several large and complex applications in various domains. These domains include financial systems, cyber- and physical-security systems, environmental monitoring, health care, manufacturing systems, telecommunication networks and power distribution grids.

Current database and data warehouse based store-and-process information management technologies are ill-suited to meet the performance, scalability, and usability requirements of these applications. This has led to the emergence of new paradigms for distributed compute systems and analytic algorithms - including Map-Reduce for offline data, and Stream processing for online data.

Stream processing is a novel distributed compute paradigm that supports the gathering, processing, and analysis of high-volume, heterogeneous, continuous data streams, to extract insights and actionable results in real time. In this course, we will cover fundamentals of the stream processing paradigm in two parts. First, we will introduce key components such as the distributed system infrastructure, a novel programming model, and application design and implementation. Next, we will discuss emerging mining algorithms for pre-processing, classification, clustering, anomaly detection, and pattern mining.

Throughout the course, we will describe the underlying theoretical principles, and provide illustrative examples from real-world case studies. Finally, this course will include hands-on exposure to large-scale stream processing through relevant homework assignments involving programming exercises on a real stream processing system. Students will design and implement a final project to explore the state-of-the art in this field.

Prerequisites

  1. Basics of Data Management and Relational Databases - Preferred
  2. Basic Signal Processing and Time Series Analysis - Preferred
  3. Basic Statistics and Data Analysis Techniques - Preferred
  4. Basics of Distributed Systems - Preferred
  5. Basics of Optimization Theory - Preferred
  6. Programming skills in C/C++ or Java or Python (mandatory)

Instructor

Dr. Deepak S. Turaga

VP of Data Science, Oden Technologies

50, 17th Street, New York, NY

Email: deepak.turaga@gmail.com

Class Website: https://sites.google.com/site/fundamentalsofstreamprocessing/

Textbook

Fundamentals of Stream Processing - Application Design, Systems and Analytics. Henrique Andrade, Bugra Gedik, and Deepak Turaga. Cambridge Press

Logistics

Class: Tuesday 7:00 PM - 9:30 PM, TBD

Office Hours: Tuesday 6:00 PM-6:45 PM

Location: 1322 Mudd

Please feel free to send email to schedule appointments at other times.

Grading

Homework – 30%

Seminar – 30%

Project – 40%

Lecture Schedule

Detailed Logistics and Grading

Homeworks and Programming Exercises: 30% of the grade

The course will involve two programming exercises using Apache Spark or Apache Beam, which will be taught during the course. These exercises will cover concepts such as data ingest, data reduction and sampling, windowing and streaming optimization.

Seminar: 30% of the grade

Students will be partitioned into groups to present one seminar per group in state-of-the art research in stream processing research. Papers and topics for seminar will be posted.

Final project: 40% of the grade

Students will design and implement group projects aimed at creating moderate-sized stream processing applications and to experiment and showcase state-of-the art algorithms in a close to real setting. For the final project, we particularly encourage the development of first of a kind or open-ended research prototypes.

The following areas can be considered:

1) Applications: The stream programming paradigm enables the development of new applications, in some cases, considerably different from existing ones. Please consider implementing a new compelling application or re-think existing applications. Interesting streaming data sources available include Twitter feeds, Stock ticker information, Healthcare data (Physionet), Live or streaming video.

2) Systems: Streaming applications can have multiple requirements in terms of performance, fault tolerance. Investigate different performance optimization strategies, or fault tolerance strategies.

3) Analytics/Algorithms: the central aspect in developing streaming applications is the processing analytics in charge of incrementally ingesting the data and detecting interesting/abnormal patterns from live streams. Substantial amount of work has been done in adapting pattern extraction and data mining algorithms to the streaming paradigm. You can consider implementing a family of algorithms from the literature and defining an application scenario to showcase your implementation. Aspects you may consider investigating include performance studies, inter-algorithm comparisons, optimization techniques, among others.

4) Others: Anything else of your choosing

Project Deliverables

(a) 10% of project grade: Dicussion on choice of project, and approach. This must minimally include the description of the problem you intend to tackle, how you will obtain data (real or synthetic), a sketch of your processing graph and overall design, and the kind of evaluation you intend to perform (e.g., performance in terms of throughput or latency, reliability, visualization, etc). I will provide feedback, and discuss whether the proposal is at the right level of difficulty and, when necessary, proposing changes and improvements.

(b) 40% of the project grade: a "professional" and polished presentation. This must include a Powerpoint (or equivalent) set of slides and a live demo. The quality and timeliness of the your presentation will account for a substantial part of your grade.

(d) 50% of the project grade: your source code, including visualization interfaces, makefiles, etc, and a final report (up to 5 pages) summarizing your project, its goals, results, and a description of how it could be further improved if you weren't so keen on going on vacation