ELEN 6889: Fundamentals of Stream Processing

Course Outline

The world is experiencing an explosion in the amount of data being produced and collected - estimated to have crossed over 100s of Zettabytes (10^21). Online data is produced continuously by different types of sensors, processes, and human activities, especially in an Internet of Things world. Being able to analyze this large amount of data as it is generated continuously, and distill insights for improved decision making, is vital to the functioning of several large and complex applications in various domains. These domains include financial systems, cyber- and physical-security systems, environmental monitoring, health care,  manufacturing systems, telecommunication networks and power distribution grids.

Current database and data warehouse based store-and-process information management technologies are ill-suited to meet the performance, scalability, and usability requirements of these applications. This has led to the emergence of new paradigms for distributed compute systems and analytic algorithms - including Map-Reduce for offline data, and Stream processing for online data.

Stream processing is a distributed compute paradigm that supports the gathering, processing, and analysis of high-volume, heterogeneous, continuous data streams, to extract insights and actionable results in real time. In this course, we will cover fundamentals of the stream processing paradigm in two parts. First, we will introduce key components such as the distributed system infrastructure, a novel programming model, and application design and implementation. Next, we will discuss emerging mining algorithms for pre-processing, classification, clustering, anomaly detection, and pattern mining.

Throughout the course, we will describe the underlying theoretical principles, and provide illustrative examples from real-world case studies. Finally, this course will include hands-on exposure to large-scale stream processing through relevant homework assignments involving programming exercises on a real stream processing system. Students will design and implement a final project to explore the state-of-the art in this field.

Prerequisites

Instructor

Dr. Deepak S. Turaga

Chief Technology Officer, Oden Technologies

50, 17th Street, New York, NY

Email: deepak.turaga@gmail.com

Class Website: https://sites.google.com/site/fundamentalsofstreamprocessing/

Textbook

Fundamentals of Stream Processing - Application Design, Systems and Analytics. Henrique Andrade, Bugra Gedik, and Deepak Turaga. Cambridge Press

Logistics

Class: Thursday 7:00 PM - 9:30 PM

Office Hours: Thursday 6:00 PM-6:45 PM

Location:  TBD

Please feel free to send email to schedule appointments at other times.

Grading

Homework – 30%

Seminar – 30%

Project – 40%

Lecture Schedule


Detailed Logistics and Grading

Homeworks and Programming Exercises: 30% of the grade  

The course will involve two programming exercises  using Apache Spark or Apache Beam, which will be taught during the course. These exercises will cover concepts such as data ingest, data reduction and sampling, windowing and streaming optimization. 

Seminar: 30% of the grade

Students will be partitioned into groups to present one seminar per group in state-of-the art research in stream processing research. Papers and topics for seminar will be posted.

Final project: 40% of the grade

Students will design and implement group projects aimed at creating moderate-sized stream processing applications and to experiment and showcase state-of-the art algorithms in a close to real setting. For the final project, we particularly encourage the development of first of a kind or open-ended research prototypes.

The following areas can be considered:

1) Applications: The stream programming paradigm enables the development of new applications, in some cases, considerably different from existing ones. Please consider implementing a new compelling application or re-think existing applications. Interesting streaming data sources available include Twitter feeds, Stock ticker information, Healthcare data (Physionet), Live or streaming video.

2) Systems: Streaming applications can have multiple requirements in terms of performance, fault tolerance. Investigate different performance optimization strategies, or fault tolerance strategies.

3) Analytics/Algorithms: the central aspect in developing streaming applications is the processing analytics in charge of incrementally ingesting the data and detecting interesting/abnormal patterns from live streams. Substantial amount of work has been done in adapting pattern extraction and data mining algorithms to the streaming paradigm. You can consider implementing a family of algorithms from the literature and defining an application scenario to showcase your implementation. Aspects you may consider investigating include performance studies, inter-algorithm comparisons, optimization techniques, among others.

4) Others: Anything else of your choosing

Project Deliverables

(b) 60% of the project grade: a "professional" and polished presentation. This must include a Powerpoint (or equivalent) set of slides and a live demo. The quality and timeliness of the your presentation will account for a substantial part of your grade.

(d) 40% of the project grade: your source code, including visualization interfaces, makefiles, etc, and a final report (up to 5 pages) summarizing your project, its goals, results, and a description of how it could be further improved if you weren't so keen on going on vacation