ML Performance Benchmarking
A Deep Dive into Deep Learning
Benchmarking and Analysis
A tutorial at ASPLOS 2020 on March 17th , 2020.
A tutorial at ASPLOS 2020 on March 17th , 2020.
The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.
The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:
Room 4
The tutorial will cover a range of different topics, including but not limited to the following:
Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (mlperf.org), an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.
We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.
There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.
The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.
Vijay Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard University, he was a an Associate Professor at The University of Texas at Austin where he continues to be an Adjunct Professor. He leads the MLPerf inference benchmarking effort. His research interests include computer architecture, compilers and runtime systems, specifically in the context of mobile and edge computing systems to improve their performance, power efficiency, and reliability.
Dr. Jinjun Xiong is Researcher and Program Director for Cognitive Computing Systems Research at the IBM Thomas J. Watson Research Center. He co-directs the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR.com), where he conducts various cutting-edge AI researchers with a group of talented students and faculty members. He has published over 100 peer-reviewed international conference papers and journals, ranging from computer vision, natural language processing, deep neural network acceleration, AI systems and solutions, VLSI systems and design automation. His publication won five Best Paper Awards and eight Nominations for Best Paper Awards. He has also led teams to win various international research competitions, including double championships at DAC’19 Systems Design Contest, first place at CVPR’18 Looking-into-Person Challenge, and student innovation award at HPEC’18 GraphChallenge.
Wen-mei W. Hwu is a Professor and the Sanders-AMD Endowed Chair in the Department of Electrical and Computer Engineering. He directs the IBM-Illinois Center for Cognitive Computing Systems Research Center and the IMPACT research group (www.impact.crhc.illinois.edu) where tools like MLModelScope is being developed. He has given numerous tutorials and has won several major teaching awards. He received the ACM SigArch Maurice Wilkes Award, the ACM Grace Murray Hopper Award, the IEEE Computer Society Charles Babbage Award, the ISCA Influential Paper Award, the MICRO Test-of-Time Paper Award, the IEEE Computer Society B. R. Rau Award and the Distinguished Alumni Award in Computer Science of the University of California, Berkeley. He is a fellow of IEEE and ACM.
Gennady Pekhimenko is an Assistant Professor at the University of Toronto, CS department and (by courtesy) ECE department, where he is leading EcoSystem (Efficient Computing Systems) group. He got his PhD from the Computer Science Department at Carnegie Mellon University under the supervision of Professor Todd C. Mowry and Professor Onur Mutlu. Gennady was a recipient of NVIDIA Graduate, Microsoft Research, Qualcomm Innovation, and NSERC CGS-D Fellowships. His research interests are on efficient memory hierarchy designs, hardware acceleration, systems for machine learning, GPUs, and compilers.
Hongyu Zhu is a 4th year PhD student, supervised by Prof. Gennady Pekhimenko at the University of Toronto. His research focuses mainly on performance profiling, analyzing and optimizing deep learning workloads at system level. He is also currently working with Amar Phanishayee from Microsoft Research, focusing on performance analysis and predictions for large-scale DNN training.
Geoffrey Yu is a research stream master’s student in the Department of Computer Science at the University of Toronto and is supervised by Professor Gennady Pekhimenko. Geoffrey received his Bachelor’s of Software Engineering (BSE) from the University of Waterloo in 2018. Geoffrey is a recipient of the Snap Research Scholarship, Vector Institute Scholarship in Artificial Intelligence, NSERC CGS-M Scholarship, and the Queen Elizabeth II Graduate Scholarship. His research interests are in distributed systems, large-scale compute and data systems, and systems for machine learning.
Abdul Dakkak is a PhD candidate in Computer Science at the University of Illinois at Urbana-Champaign (UIUC) and a senior compiler developer at Wolfram Research. He is a research assistant in the IMPACT Research Group working with Professor Wen-mei Hwu. Abdul’s research interests are in functional programming languages and how they compile to accelerate single and multi-node systems. To this end, Abdul is a primary developer on the Wolfram compiler. Abdul has been actively involved in teaching activities. He has aided in teaching the Coursera HPP course (3 times), the introductory and advanced CUDA courses (2 times), and was involved in the PUMPS summer school at BSC (4 times). Abdul developed tools to enable teaching for large classrooms and is the author of WebGPU and RAI. Recently Abdul has been developing MLModelScope, which is a distributed platform allowing people to easily deploy, profile, and experiment with ML/DL frameworks and models.
Cheng Li is a PhD candidate in Computer Science at the University of Illinois at Urbana-Champaign (UIUC) and a member of the IMPACT Research Group led by Professor Wen-Mei Hwu. Her research lies in the field of GPU-accelerated applications, with an emphasis on Deep Learning. Her work has focused on understanding, characterizing, and optimizing Deep Learning workloads. In the process, She has developed a number of open-source tools to benchmark, profile, and summarize Deep Learning training and inference across hardware and software stacks. The tools have been used to inform system design for Deep Learning model serving and develop highly tuned GPU kernels for model inference. Learn more at https://cli99.netlify.com/.
Lingyi Liu is a Research Scientist in Facebook since 2016 and mostly works on performance optimization for deep learning models, framework and systems in AI Infra team. Prior to that, he worked on compiler for FPGA based emulation system in Synopsys for three years, where he invented a systematic formal method greatly improving the emulation system performance. He obtained his PhD degree in computer engineering from University of Illinois at Urbana-Champaign in 2013. His PhD research is about how to harmonize static analysis and machine learning techniques for system verification, and he co-developed several advanced algorithms and widely used tools for hardware verification.
Zhizhen Qin is a Research Assistant studying reinforcement learning at University of California at San Diego, where he obtained the bachelor’s degree in December 2018. In the past, he has done two internships with Facebook: one in AI Infra's Caffe2 team in 2019, and the other with AML's computer vision team in 2018. He created the visualization feature for the OSS Facebook AI-PEP.
Carole-Jean Wu is a Research Scientist at Facebook’s AI Infrastructure Research. She is also a tenured Associate Professor of CSE in Arizona State University. Carole-Jean’s research focuses in Computer and System Architectures. More recently, her research has pivoted into designing and optimizing systems for machine learning. She is the leading author of “Machine Learning at Facebook: Understanding Inference at the Edge” that presents unique design challenges faced when deploying ML solutions at scale to the edge, from over billions of smartphones to Facebook’s virtual reality platforms. She is co-chairing the MLPerf Inference working group. Carole-Jean received her M.A. and Ph.D. from Princeton and B.Sc. from Cornell.
Peter Mattson leads the ML Performance Measurement team at Google Brain. He is a co-founder of MLPerf and serves as the General Chair. Previously, he founded the Programming Systems and Applications Group at NVIDIA Research, was VP of Software Infrastructure for Stream Processors Inc (SPI), and was a managing engineer at Reservoir Labs. His research focus is on measuring and understanding the performance and quality of machine learning systems. Peter holds a PhD and MS from Stanford University and a BS from the University of Washington.