ML Performance

BenchmarkiNg Deep Learning Systems

To be held along with IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

August 23rd, 2020

About the Tutorial

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:

    1. What are the benchmarks that can effectively capture the scope of the ML/DL domain?

    2. Are the existing frameworks sufficient for this purpose?

    3. What are some of the industry-standard evaluation platforms or harnesses?

    4. What are the metrics for carrying out an effective comparative evaluation?



      • 08:00 - 08:10 AM Opening (slide deck)

      • 08:00 - 10:20 AM Introduction to ML Benchmarking (DAWNBench, MLPerf, TBD)

        • The DAWN of MLPerf (slide deck)

        • MLPerf Training & Inference (slide deck)

        • The TBD Tools & Benchmarks (slide deck)

      • 10:20 - 10:30 AM Coffee break

      • 10:30 - 12:00 PM Across-stack Analysis of DL Pipelines & Informing DL Optimizations through Benchmarking (slide deck)

Program Overview

The tutorial will cover a range of different topics, including but not limited to the following:

Representative benchmarks for the ML domain

Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (, an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.

Challenges presented by existing frameworks

We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.

Tools and Methodologies for Evaluating platforms

There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.

Expected Outcome

The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.

Related Material


    • This is the first time the tutorial is being held!


  • Cody Coleman (Stanford), Wen-mei Hwu (University of Illinois Urbana-Champaign), Vijay Janapa Reddi (Harvard University, Google), Gennady Pekhimenko (University of Toronto), Carole-Jean Wu (Facebook, ASU), Jinjun Xiong (IBM Thomas J. Watson Research)


Carole-Jean Wu

Carole-Jean Wu is a Research Scientist at Facebook AI Research. Her research focus lies in the domain of computer system architecture with particular emphasis on energy- and memory-efficient systems. Her recent research has pivoted into designing systems for machine learning execution at-scale, such as for personalized recommender systems and mobile deployment. Carole-Jean chairs the MLPerf Recommendation Benchmark Advisory Board and co-chairs MLPerf Inference.

Carole-Jean holds tenure from ASU and received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell. She is the recipient of the NSF CAREER Award, Facebook AI Infrastructure Mentorship Award, the IEEE Young Engineer of the Year Award, the Science Foundation Arizona Bisgrove Early Career Scholarship, and the Intel PhD Fellowship, among a number of Best Paper awards.

Vijay Reddi

Vijay Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard University, he was a an Associate Professor at The University of Texas at Austin where he continues to be an Adjunct Professor. He leads the MLPerf inference benchmarking effort. His research interests include computer architecture, compilers and runtime systems, specifically in the context of mobile and edge computing systems to improve their performance, power efficiency, and reliability.

Cody Coleman

Cody Coleman is a fourth-year computer science Ph.D. student at Stanford University, advised by Professors Matei Zaharia and Peter Bailis. His research aims to democratize machine learning by reducing the cost of producing state-of-the-art models and creating novel abstractions that simplify machine learning development and deployment. His recent work spans from performance benchmarking of hardware and software systems (i.e., DAWNBench and MLPerf) to computationally efficient methods for active learning and core-set selection. His Ph.D. has been supported by the NSF GRFP, the Stanford DAWN Project, and the Open Phil AI Fellowship.

Gennady pekimenko

Gennady Pekhimenko is an Assistant Professor at the University of Toronto, CS department and (by courtesy) ECE department, where he is leading the EcoSystem (Efficient Computing Systems) group. Gennady is also a Faculty Member at Vector Institute and a CIFAR AI chair. Before joining Univ. of Toronto, he spent a year in 2017 at Microsoft Research in Redmond in Systems Research group. He got his PhD from the Computer Science Department at Carnegie Mellon University in 2016. Gennady is a recipient of Amazon Machine Learning Research Award, Facebook Faculty Research Award, Connaught New Researcher Award, NVIDIA Graduate, Microsoft Research, Qualcomm Innovation, and NSERC CGS-D Fellowships. His research interests are in the areas of computer architecture, hardware acceleration, systems for machine learning, and compilers.

Cheng Li

Dr. Cheng Li received her Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (UIUC) where she was advised by Professor Wen-Mei Hwu. She is joining Microsoft as a senior researcher. Her research lies in the field of GPU-accelerated applications, with an emphasis on Deep Learning. Her recent work has focused on understanding and optimizing Deep Learning workloads. Before UIUC, she received her MS and BS degree in Computer Science and Engineering from the University of Michigan, and BS degree in Electrical Engineering from Shanghai Jiao Tong University.

Abdul Dakkak

Abdul Dakkak is a Principal Research Software Engineer at Microsoft Research AI working on next-generation compilers for end-to-end machine learning. Before then, Abdul Dakkak received his Ph.D. in Computer Science from the University of Illinois at Urbana-Champaign (UIUC). He was a senior compiler developer at Wolfram Research, leading the design and development of the Wolfram Compiler effort for over 6 years. Abdul’s research interest lies at the intersection of machine learning, compilers, programming languages, and accelerated computing. His focus is compiling high-level languages into performant code running on different hardware. In the process, he has developed industry-grade tools for compiling, running, profiling, and introspecting real-world applications to optimize their performance across both the hardware and software stack. As a primary developer of the Wolfram Compiler, Abdul has developed the Wolfram type system and architected the Wolfram runtime. As a result, the compiled Wolfram code matches the speed to hand-optimized C code and can target accelerator and multi-node systems.

Aside from the compiler work, Abdul also has been developing MLModelScope, which is a distributed platform allowing people to deploy, profile, and experiment with ML/DL frameworks and models. The tools are used to inform system design for Deep Learning model serving and develop highly tuned GPU kernels for model inference.