Course ELEN 6889

Course Outline
There is an increasing number of applications that require processing and classification of continuous, high volume data streams. These include photo and video streaming services, online analysis of financial data streams, real-time manufacturing process control, search engines, spam filters, security, and medical services etc. These applications are often developed as processing topologies of distributed operators deployed on large-scale stream mining systems. Distributed stream mining systems provide scalability, reliability, ease of development, failure resilience, and other performance objectives of such large-scale, real-time stream mining applications.
In this course we will cover the fundamentals of stream processing systems, algorithms and applications. We will describe the underlying systems architecture, programming models, and algorithms for stream processing, mining and analysis. We will include hands-on exposure to large-scale stream processing through relevant homework assignments involving programming exercises on IBM InfoSphere Streams - a distributed stream processing system. We will also design student seminars and projects to explore state-of-the art in the field, and open research challenges.

Prerequisites
Basics of Data Management and Relational Databases - Preferred
Basic Signal Processing and Time Series Analysis - Preferred
Basic Statistics and Data Analysis Techniques - Preferred
Basics of Distributed Systems - Preferred
Basics of Optimization Theory - Preferred
Programming skills in C/C++ (mandatory) and scripting languages such as Perl and Python (highly recommended)

Instructors
Dr. Henrique Andrade and Dr. Deepak S. Turaga
IBM T. J. Watson Research Center
19 Skyline Drive, Hawthorne, NY
Email

Logistics
Class: Wednesday 4:10 PM - 6:00 PM, SCE 415
Office Hours: Wednesday 3:00 PM-4:00 PM, Mudd 1312.
Please feel free to send email to schedule appointments at other times.

Grading
Programming Exercises: 40% of the grade
Final project: 60% of the grade


Lecture Schedule

 Date Lecture Title Topics
 Jan 20
 1Introduction to Large-Scale Stream ProcessingMotivation & Applications in different sectors
Introduction to Stream Processing
Overview of Stream Processing Systems - Aurora, Borealis, Stream, System S
 Jan 27
 Feb 3
 2Stream Processing SystemsDistributed Systems, Transport, Processing Elements. Management and Tooling of a Stream Processing System
 Feb 17
 3Developing Stream Processing Applications in SPADE. Part I
Semantics and Languages
Distributed Programming
 Feb 24
 4Developing Stream Processing Applications in SPADE. Part IISemantics and Languages
Distributed Programming
 Mar 3
 5Developing Stream Processing Applications: Design Patterns and Guidelines
Guidelines
Patterns and Implementation Examples
 Mar 10
 6Homework Discussion and Student Seminar Proposals
 Mid-term week
 Mar 17
 Spring Break
 
 Mar 24
 7Data Reduction for Stream Processing Summarization, Sampling, Quantization, Load-Shedding
 Mar 31
 Seminar and Project Discussions 
 Apr 7-Apr 14
 8Large-Scale Stream MiningClustering, Classification, Pattern Mining, Change detection
 Apr 21-Apr 28
 9Resource Adaptive MiningOperator Scheduling, Query and Stream Optimization, Adaptive Processing, Fault-tolerance, Classifier configuration
 May 5
 14 Project Presentations
 
 May 12
 15 Project Presentations
 Finals week




Detailed Logistics and Grading

Homeworks and Programming Exercises
: 40% of the grade (see problem description in the files associated with our class Google Groups page)   
 


Final project: 60% of the grade

Students will work on group projects to develop moderate sized stream processing applications, tooling, and to demonstrate some state-of-the art algorithms in a real implementation. Design and development of open-ended research prototypes is strongly encouraged.

The following areas can be considered:

1) Fault-tolerance: many streaming applications work or will have to work as part of critical business or scientific infrastructures. A fundamental design aspect in this case is to plan for outages and failures. You can consider application designs and strategies that can increase the reliability level of a streaming application. Aspects you may consider investigating may fall in the realm of topological redundancy, fault modeling, impact on performance, among others. A project in this are must include a technique as well as a sample application.

2) Analytics: the central aspect in developing streaming applications is the processing analytics in charge of incrementally ingesting the data and detecting interesting/abnormal patterns from live streams. Substantial amount of work has been done in adapting pattern extraction and data mining algorithms to the streaming paradigm. You can consider implementing a family of algorithms from the literature and defining an application scenario to showcase your implementation. Aspects you may consider investigating include performance studies, inter-algorithm comparisons, optimization techniques, among others.

3) Applications: the stream programming paradigm enables the development of new applications, in some cases, considerably different from existing ones, which tend to work in batch or based on data stored in warehouses. You may consider implementing a new compelling application or re-think existing Internet applications such as instant messaging, Tweeter, or others in the context of stream processing middleware platform.

4) Software Engineering Support: the stream processing programming paradigm is very much in its infancy. You may consider developing tools that will help developers better create new applications. Tooling in the realm of debugging, visualization, client interfaces in mobile devices among others can be created.

5) Data-At-Rest and Data-In-Motion Integration: a challenging emerging problem is the integration of live data streams with data that has been accumulated and aggregated in static data sources, such as databases, data warehouses, flat files, and others. In many cases, these static data sources (data-at-rest) can be used to enrich the processing of live streams (data-in-motion). Doing so requires coming up with operators and optimization strategies that enable the application to perform at wire speed. You may consider query and caching strategies that might help. You will want to propose an application scenario to demonstrate your design.

The deliverables:

(a) 10% of project grade (you already have provided this): a concise report (2 pages) describing your intentions. This must minimally include the description of the problem you intend to tackle, how you will obtain data (real or synthetic), a sketch of your processing graph and overall design, and the kind of evaluation you intend to perform (e.g., performance in terms of throughput or latency, reliability, visualization, etc). The instructors will provide written feedback, describing whether the proposal is at the right level of difficulty and, when necessary, proposing changes and improvements.
(b) 20% of the project grade (due on Apr 28): a concise report (2 pages) describing your progress. This must minimally include a sketch of your application (with code snippets of the most relevant parts) and preliminary results.
(c) 30% of the project grade (due on your presentation day): a "professional" and polished presentation. This must include a Powerpoint (or equivalent) set of slides and a live demo. The quality and timeliness of the your presentation will account for a substantial part of your grade.
(d) 40% of the project grade (due on May 12): your source code, including visualization interfaces, makefiles, etc, and a final report (5 pages) summarizing your project, its goals, results, and a description of how it could be further improved if you weren't so keen on going on vacation.



Copyright 2009/2010 Deepak Turaga and Henrique Andrade. All rights reserved.