ML Performance Benchmarking

A Deep Dive into Deep Learning

Benchmarking and Analysis

A tutorial at ASPLOS 2020 on March 17th , 2020.

About the Tutorial

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:

    1. What are the benchmarks that can effectively capture the scope of the ML/DL domain?
    2. Are the existing frameworks sufficient for this purpose?
    3. What are some of the industry-standard evaluation platforms or harnesses?
    4. What are the metrics for carrying out an effective comparative evaluation?

Schedule (TBD)

    • 08:00 – 10:00 AM Introduction to MLPerf Training and Inference
    • 10:00 – 10:30 AM Coffee break
    • 10:30 – 12:30 PM Challenges and Pitfalls in Benchmarking ML
    • 12:30 – 02:00 PM Lunch break
    • 02:00 – 03:00 PM MLModelScope Deep Dive
    • 03:00 – 04:00 PM MLModelScope for MLPerf
    • 04:00 – 04:30 PM Coffee break
    • 04:30 – 05:30 PM Tools and Methodologies
    • 05:30 – 06:30 PM Open Issues / Challenges


Room 4

Program Overview

The tutorial will cover a range of different topics, including but not limited to the following:

Representative benchmarks for the ML domain

Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (, an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.

Challenges presented by existing frameworks

We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.

Tools and Methodologies for Evaluating platforms

There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.

Expected Outcome

The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.

Related Material


    • Peter Mattson (Google)
    • Vijay Janapa Reddi (Harvard University)
    • Gennady Pekhimenko (University of Toronto)
    • Jinjun Xiong (IBM)
    • Hongyu Zhu (University of Toronto)
    • Geoffrey Yu (University of Toronto)
    • Abdul Dakkak (UIUC)
    • Cheng Li (UIUC)
    • Carole-Jean Wu (Facebook/ASU)



  • Vijay Janapa Reddi (Harvard University), Wen-mei Hwu (UIUC), Jinjun Xiong (IBM), Gennady Pekhimenko (University of Toronto), Karthik V Swaminathan (IBM), Cody Coleman (Stanford), Carole-Jean Wu (Facebook/ASU)

Presenter Biography

Vijay janapa reddi

Vijay Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard University, he was a an Associate Professor at The University of Texas at Austin where he continues to be an Adjunct Professor. He leads the MLPerf inference benchmarking effort. His research interests include computer architecture, compilers and runtime systems, specifically in the context of mobile and edge computing systems to improve their performance, power efficiency, and reliability.

Jinjun Xiong

Dr. Jinjun Xiong is Researcher and Program Director for Cognitive Computing Systems Research at the IBM Thomas J. Watson Research Center. He co-directs the IBM-Illinois Center for Cognitive Computing Systems Research (, where he conducts various cutting-edge AI researchers with a group of talented students and faculty members. He has published over 100 peer-reviewed international conference papers and journals, ranging from computer vision, natural language processing, deep neural network acceleration, AI systems and solutions, VLSI systems and design automation. His publication won five Best Paper Awards and eight Nominations for Best Paper Awards. He has also led teams to win various international research competitions, including double championships at DAC’19 Systems Design Contest, first place at CVPR’18 Looking-into-Person Challenge, and student innovation award at HPEC’18 GraphChallenge.

Gennady Pekhimenko

Gennady Pekhimenko is an Assistant Professor at the University of Toronto, CS department and (by courtesy) ECE department, where he is leading EcoSystem (Efficient Computing Systems) group. He got his PhD from the Computer Science Department at Carnegie Mellon University under the supervision of Professor Todd C. Mowry and Professor Onur Mutlu. Gennady was a recipient of NVIDIA Graduate, Microsoft Research, Qualcomm Innovation, and NSERC CGS-D Fellowships. His research interests are on efficient memory hierarchy designs, hardware acceleration, systems for machine learning, GPUs, and compilers.

Hongyu ZHU

Hongyu Zhu is a 4th year PhD student, supervised by Prof. Gennady Pekhimenko at the University of Toronto. His research focuses mainly on performance profiling, analyzing and optimizing deep learning workloads at system level. He is also currently working with Amar Phanishayee from Microsoft Research, focusing on performance analysis and predictions for large-scale DNN training.

Geoffrey yu

Geoffrey Yu is a research stream master’s student in the Department of Computer Science at the University of Toronto and is supervised by Professor Gennady Pekhimenko. Geoffrey received his Bachelor’s of Software Engineering (BSE) from the University of Waterloo in 2018. Geoffrey is a recipient of the Snap Research Scholarship, Vector Institute Scholarship in Artificial Intelligence, NSERC CGS-M Scholarship, and the Queen Elizabeth II Graduate Scholarship. His research interests are in distributed systems, large-scale compute and data systems, and systems for machine learning.

Abdul Dakkak

Abdul Dakkak is a PhD candidate in Computer Science at the University of Illinois at Urbana-Champaign (UIUC) and a senior compiler developer at Wolfram Research. He is a research assistant in the IMPACT Research Group working with Professor Wen-mei Hwu. Abdul’s research interests are in functional programming languages and how they compile to accelerate single and multi-node systems. To this end, Abdul is a primary developer on the Wolfram compiler. Abdul has been actively involved in teaching activities. He has aided in teaching the Coursera HPP course (3 times), the introductory and advanced CUDA courses (2 times), and was involved in the PUMPS summer school at BSC (4 times). Abdul developed tools to enable teaching for large classrooms and is the author of WebGPU and RAI. Recently Abdul has been developing MLModelScope, which is a distributed platform allowing people to easily deploy, profile, and experiment with ML/DL frameworks and models.

Cheng Li

Cheng Li is a PhD candidate in Computer Science at the University of Illinois at Urbana-Champaign (UIUC) and a member of the IMPACT Research Group led by Professor Wen-Mei Hwu. Her research lies in the field of GPU-accelerated applications, with an emphasis on Deep Learning. Her work has focused on understanding, characterizing, and optimizing Deep Learning workloads. In the process, She has developed a number of open-source tools to benchmark, profile, and summarize Deep Learning training and inference across hardware and software stacks. The tools have been used to inform system design for Deep Learning model serving and develop highly tuned GPU kernels for model inference. Learn more at

Lingyi liu

Lingyi Liu is a Research Scientist in Facebook since 2016 and mostly works on performance optimization for deep learning models, framework and systems in AI Infra team. Prior to that, he worked on compiler for FPGA based emulation system in Synopsys for three years, where he invented a systematic formal method greatly improving the emulation system performance. He obtained his PhD degree in computer engineering from University of Illinois at Urbana-Champaign in 2013. His PhD research is about how to harmonize static analysis and machine learning techniques for system verification, and he co-developed several advanced algorithms and widely used tools for hardware verification.

Zhizhen Qin

Zhizhen Qin is a Research Assistant studying reinforcement learning at University of California at San Diego, where he obtained the bachelor’s degree in December 2018. In the past, he has done two internships with Facebook: one in AI Infra's Caffe2 team in 2019, and the other with AML's computer vision team in 2018. He created the visualization feature for the OSS Facebook AI-PEP.

Carole-jean wu

Carole-Jean Wu is a Research Scientist at Facebook’s AI Infrastructure Research. She is also a tenured Associate Professor of CSE in Arizona State University. Carole-Jean’s research focuses in Computer and System Architectures. More recently, her research has pivoted into designing and optimizing systems for machine learning. She is the leading author of “Machine Learning at Facebook: Understanding Inference at the Edge” that presents unique design challenges faced when deploying ML solutions at scale to the edge, from over billions of smartphones to Facebook’s virtual reality platforms. She is co-chairing the MLPerf Inference working group. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.