BenchmarkiNg Deep Learning Systems
A tutorial at ISCA 2019 on Saturday, June 22, Phoenix, Arizona, USA
About the Tutorial
The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.
The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:
- What are the benchmarks that can effectively capture the scope of the ML/DL domain?
- Are the existing frameworks sufficient for this purpose?
- What are some of the industry-standard evaluation platforms or harnesses?
- What are the metrics for carrying out an effective comparative evaluation?
- 08:30 - 10:00 AM Introduction to MLPerf
- 10:00 - 10:30 AM Coffee break
- 10:30 - 12:00 PM Challenges and Pitfalls in Benchmarking ML
- 12:00 - 01:30 PM Lunch break
- 01:30 - 02:30 PM MLModelScope Deep Dive
- 02:30 - 03:00 PM MLModelScope for MLPerf
- 03:00 - 03:30 PM Coffee break
- 03:30 - 05:00 PM Tools and Methodologies
- 05:00 - 05:30 PM Open Issues / Challenges
- 05:30 - 06:00 PM Closing
The tutorial will cover a range of different topics, including but not limited to the following:
Representative benchmarks for the ML domain
Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (mlperf.org), an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.
Challenges presented by existing frameworks
We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.
Tools and Methodologies for Evaluating platforms
There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.
The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.
- Vijay Janapa Reddi (Harvard University)
- Jinjun Xiong (IBM)
- Wen-mei Hwu (UIUC)
- Gennady Pekhimenko (University of Toronto)
- Abdul Dakkak (UIUC)
- Cheng Li (UIUC)
- This is the first time the tutorial is being held!
- Vijay Janapa Reddi (Harvard University), Wen-mei Hwu (UIUC), Jinjun Xiong (IBM), Karthik V Swaminathan (IBM), Cody Coleman (Stanford), Carole-Jean Wu (ASU and Facebook)