ML Performance

BenchmarkiNg Deep Learning Systems

To be held along with International Symposium on Computer Architecture (ISCA'21)

June 18, 2021

About the Tutorial

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:

1. What are the benchmarks that can effectively capture the scope of the ML/DL domain?
2. Are the existing frameworks sufficient for this purpose?
3. What are some of the industry-standard evaluation platforms or harnesses?
4. What are the metrics for carrying out an effective comparative evaluation?

REGISTER HERE:

www.iscaconf.org/isca2021/attend/register.php

Schedule

(All times are EDT)

12:00 PM - 12:05 PM: Opening
12:05 PM - 12:25 PM: The DAWN of MLPerf - Cody Coleman (Stanford) <slides>
12:25 PM - 12:55 PM: MLPerf Training and Inference - Peter Mattson (Google), Ramesh Chukka (Intel) <slides>
12:55 PM - 1:15 PM: Skyline and RL-Scope - James Gleeson (University of Toronto)
1:15 PM - 1:30 PM: Coffee Break
1:30 PM - 2:55 PM: Cross-Stack Analysis of DL Pipelines and DL Optimizations - Jinjun Xiong (IBM), Wen-mei Hwu (UIUC) <slides>
2:55 PM - 3:00 PM: Conclusion

Program Overview

The tutorial will cover a range of different topics, including but not limited to the following:

Representative benchmarks for the ML domain

Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (mlperf.org), an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.

Challenges presented by existing frameworks

We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.

Tools and Methodologies for Evaluating platforms

There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.

Expected Outcome

The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.

Related Material

PRIOR EVENTS

Organizers

Tom St. John

Tom St. John is a senior software engineer at Cruise, where he serves as a technical lead for the Future Compute Group. Prior to his current role, he led the distributed machine learning performance optimization efforts within Tesla Autopilot and served as the director of the AI Co-Design Center at Wave Computing. His research primarily focuses on the intersection of parallel programming models and computer architecture design, and the impact that this has on large-scale machine learning. His work has resulted in a number of patents and publications, including a HiPEAC 2020 paper award and a best paper award at ADAPT'14.

Carole-Jean Wu

Carole-Jean Wu is a Research Scientist at Facebook AI Research. Her research focus lies in the domain of computer system architecture with particular emphasis on energy- and memory-efficient systems. Her recent research has pivoted into designing systems for machine learning execution at-scale, such as for personalized recommender systems and mobile deployment. In general, she is interested in tackling system challenges to enable efficient, responsible AI execution. Carole-Jean chairs the MLPerf Recommendation Benchmark Advisory Board, co-chaired MLPerf Inference, and serves on the MLCommons Board as a director. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.

Vijay Janapa Reddi

Vijay Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard University, he was an Associate Professor at The University of Texas at Austin where he continues to be an Adjunct Professor. He leads the MLPerf inference benchmarking effort. His research interests include computer architecture, compilers and runtime systems, specifically in the context of mobile and edge computing systems to improve their performance, power efficiency, and reliability.