ELEN 6889: Fundamentals of Stream Processing

Course Outline
The world is experiencing an explosion in the amount of data being produced and collected - estimated to have crossed over 10s of Zettabytes (10^21). Online data is produced continuously by different types of sensors, processes, and human activities, especially in an Internet of Things world. Being able to analyze this large amount of data as it is generated continuously, and distill insights for improved decision making, is vital to the functioning of several large and complex applications in various domains. These domains include financial systems, cyber- and physical-security systems, environmental monitoring, health care,  manufacturing systems, telecommunication networks and power distribution grids.

Current database and data warehouse based store-and-process information management technologies are ill-suited to meet the performance, scalability, and usability requirements of these applications. This has led to the emergence of new paradigms for distributed compute systems and analytic algorithms - including Map-Reduce for offline data, and Stream processing for online data.

Stream processing is a novel distributed compute paradigm that supports the gathering, processing, and analysis of high-volume, heterogeneous, continuous data streams, to extract insights and actionable results in real time. In this course, we will cover fundamentals of the stream processing paradigm in two parts. First, we will introduce key components such as the distributed system infrastructure, a novel programming model, and application design and implementation. Next, we will discuss emerging mining algorithms for pre-processing, classification, clustering, anomaly detection, and pattern mining.

Throughout the course, we will describe the underlying theoretical principles, and provide illustrative examples from real-world case studies. Finally, this course will include hands-on exposure to large-scale stream processing through relevant homework assignments involving programming exercises on a real stream processing system. Students will design and implement a final project to explore the state-of-the art in this field.

Basics of Data Management and Relational Databases - Preferred
Basic Signal Processing and Time Series Analysis - Preferred
Basic Statistics and Data Analysis Techniques - Preferred
Basics of Distributed Systems - Preferred
Basics of Optimization Theory - Preferred
Programming skills in C/C++ or Java (mandatory) and scripting languages such as Perl and Python (recommended)

Dr. Deepak S. Turaga
IBM T. J. Watson Research Center
1101 Kitchawan Road, Yorktown Heights, NY 10594
Email: turaga@us.ibm.com

Class Discussion List: streams2018@googlegroups.com
Class Website: https://sites.google.com/site/fundamentalsofstreamprocessing/

Fundamentals of Stream Processing - Application Design, Systems and Analytics. Henrique Andrade, Bugra Gedik, and Deepak Turaga. Cambridge Press

Class: Tuesday 7:00 PM - 9:30 PM, 214 Pupin Laboratories
Office Hours: Tuesday 6:00 PM-6:45 PM
Location: 1322 Mudd
Please feel free to send email to schedule appointments at other times.

Homework – 20%
Seminar – 20%
Project – 40%
Final – 20%

Lecture Schedule

Jan 16


Motivation, Applications, Examples of Stream Processing Systems

Jan 23

Stream Processing Systems I

Streams, Overview, Usage

Jan 30

Stream Processing Systems II

Apache Spark, Overview, Usage

Feb 06

Stream Relational Processing – HW 1

Fundamentals concepts and operations

Feb 13

Application Design Principles

Patterns and Optimizations

Feb 20

Streaming Algorithms I -- HW 2

Pre-processing: Descriptive Statistics, Sampling, Sketches

Seminar Topic Selection

Feb 27

Streaming Algorithms II

Pre-processing: Transforms, Quantization, Dimensionality Reduction

Mar 06

Seminar Presentations

Mar 13

Spring Break

Mar 20

Streaming Algorithms III

Mining: Classification, Regression

Project Topic Selection

Mar 27

Streaming Algorithms IV

Mining: Clustering, Frequent Pattern and Association Rules Mining

Apr 03

Intermediate Project Reviews

Apr 10

Advanced Topics

Stream Mining Topologies, Distributed and online learning

Apr 17

Project Presentations

Apr 24

Project Presentations

Detailed Logistics and Grading

Homeworks and Programming Exercises
: 20% of the grade (problem description will be uploaded to our class Google Groups)   
The course will involve two programming exercises on IBM Streams  using its SPL programming language (or Apache Spark), which will be taught during the course. These exercises will cover concepts such as data ingest, data reduction and sampling, windowing and time series analysis, and stream mining. 

Seminar: 20% of the grade
Students will be partitioned into groups to present one seminar per group in state-of-the art research in stream processing research. Papers and topics for seminar will be posted.

Final project: 40% of the grade
Students will design and implement group projects aimed at creating moderate-sized stream processing applications and to experiment and showcase state-of-the art algorithms in a close to real setting. For the final project, we particularly encourage the development of first of a kind or open-ended research prototypes.

The following areas can be considered:

1) Applications: The stream programming paradigm enables the development of new applications, in some cases, considerably different from existing ones. Please consider implementing a new compelling application or re-think existing applications. Interesting streaming data sources available include Twitter feeds, Stock ticker information, Healthcare data (Physionet), Live or streaming video.

2) Systems: Streaming applications can have multiple requirements in terms of performance, fault tolerance. Investigate different performance optimization strategies, or fault tolerance strategies.

3) Analytics/Algorithms: the central aspect in developing streaming applications is the processing analytics in charge of incrementally ingesting the data and detecting interesting/abnormal patterns from live streams. Substantial amount of work has been done in adapting pattern extraction and data mining algorithms to the streaming paradigm. You can consider implementing a family of algorithms from the literature and defining an application scenario to showcase your implementation. Aspects you may consider investigating include performance studies, inter-algorithm comparisons, optimization techniques, among others.

4) Others: Anything else of your choosing

Project Deliverables
(a) 10% of project grade: Dicussion on choice of project, and approach. This must minimally include the description of the problem you intend to tackle, how you will obtain data (real or synthetic), a sketch of your processing graph and overall design, and the kind of evaluation you intend to perform (e.g., performance in terms of throughput or latency, reliability, visualization, etc). I will provide feedback, and discuss whether the proposal is at the right level of difficulty and, when necessary, proposing changes and improvements.
(b) 40% of the project grade: a "professional" and polished presentation. This must include a Powerpoint (or equivalent) set of slides and a live demo. The quality and timeliness of the your presentation will account for a substantial part of your grade.
(d) 50% of the project grade: your source code, including visualization interfaces, makefiles, etc, and a final report (up to 5 pages) summarizing your project, its goals, results, and a description of how it could be further improved if you weren't so keen on going on vacation

Final Exam: 20% of the grade Cover concepts from stream mining algorithms.