Debugging Primitives for Interactive Big Data Processing in Spark 

An abundance of data in science, engineering, national security, and health care has led to the emerging field of big data analytics.  To process massive quantities of data, developers leverage data-intensive scalable computing (DISC) systems in the cloud, such as Google's MapReduce, Hadoop, and Apache Spark.  While DISC systems help to address the scalability challenges of big data analytics, they also introduce new challenges in debugging. The "big data debugging" project at UCLA addresses this debugging challenge by combining insights from both software engineering and database research communities. 

We design interactive, real-time debugger called BigDebug for the next generation data-intensive scalable cloud computing platform, Apache Spark. BigDebug's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. 

We  design new data provenance and optimized incremental computation capabilities for Apache Spark to effectively and efficiently support debugging. We also realize automated debugging in DISC by applying automated fault isolation in software engineering and data provenance in database systems in tandem to find a minimum set of failure-inducing inputs. 

The project is developed at UCLA, and the project is led by Professors Miryung Kim, Tyson Condie and Todd MillsteinBigDebug is managed by a PhD student, Muhammad Ali Gulzar who works under the supervision of Professor Miryung Kim.  Our first paper on interactive debugging primitives appeared in ICSE 2016.

BigDebug: Debugging Primitives for Interactive Big Data Processing in Spark, Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, Miryung Kim, ICSE' 16: Proceedings of 38th IEEE/ACM International Conference on Software Engineering, pages 784-795 (pdf)

BigDebug: Interactive Debugger for Big Data Analytics in Apache Spark, Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, Miryung Kim, FSE' 16: The 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, Demonstration Track, pages 1033-1037 (pdf)

Debugging Big Data Analytics in Spark with BigDebug, Muhammad Ali Gulzar, Matteo Interlandi, Tyson Condie, Miryung Kim, SIGMOD' 17: Proceedings of The 2017 ACM SIGMOD/PODS Conference, Demonstration Track, pages 1627-1630 (pdf

Titian: Data Provenance Support in Spark, Matteo Interlandi, Kshitij Shah, Sai Tetali, Muhammad  Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, Tyson Condie, VLDB' 16 (PVLDB Volume 9 Issue 3): Proceedings of the 42nd Conference on Very Large Data Bases. pages 216-227, (pdf)

Optimizing Interactive Development of Data-Intensive Applications, Matteo Interlandi, Sai Deep Tetali, Muhammad Ali Gulzar, Joseph Noor, Tyson Condie, Miryung Kim, Todd D. Millstein, SoCC' 16: ACM Symposium on Cloud Computing 2016. pages 510-522 (pdf)

Interactive Debugging for Big Data Analytics, Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi, Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein, Miryung Kim, HotCloud 2016, The 8th USENIX Workshop on Hot Topics in Cloud Computing, 5 pages (pdf)

A step-by-step demonstration video of BigDebug can be seen here.

If you build your project based on our prototype, please cite this paper. 

                        BigDebug's User Interface                                                 BigDebug's API Usage