ML Performance

BenchmarkiNg Deep Learning Systems

To be held along with the 30th IEEE International Symposium on High-Performance Computer Architecture (HPCA'24)

March 2, 2024

About the Tutorial

The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform models, frameworks, and system stacks. It lacks standard tools and methodologies to evaluate and profile models or systems. Due to the absence of standard tools, the state of the practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error-prone — stifling the adoption of the innovations in a rapidly moving field.

The goal of the tutorial is to bring experts from the industry and academia together to shed light on the following topics to foster systematic development, reproducible evaluation, and performance analysis of deep learning artifacts. It seeks to address the following questions:

1. What are the benchmarks that can effectively capture the scope of the ML/DL domain?
2. Are the existing frameworks sufficient for this purpose?
3. What are some of the industry-standard evaluation platforms or harnesses?
4. What are the metrics for carrying out an effective comparative evaluation?

REGISTER HERE

www.hpca-conf.org/2024/attend/register.php

Schedule

8:25 - 8:30 AM: Introduction - Tom St. John (Meta)
8:30 AM - 9:00 AM: Hardware-Agnostic Algorithms for Sequence Modeling - Tri Dao (Princeton, Together AI) <slides>
9:00 AM - 9:30 AM: Benchmarking Speed and Safety with MLPerf - Peter Mattson (Google) <slides>
9:30 AM - 10:00 AM: SECDA-TFLite: Efficient Development of FPGA-Based DNN Accelerators for Edge Inference - Jose Cano (University of Glasgow) <slides>
10:00 AM - 10:30 AM: Automatically Composing High-Performance and Cost-Efficient AI Systems with MLCommons' Collective Mind and MLPerf - Grigori Fursin (cTuning Foundation) <slides>
10:30 AM - 10:50 AM: Coffee Break
10:50 AM - 11:20 AM: Accelerating AI from Edge to Scale - Divya Mahajan (Georgia Tech) <slides>
11:20 AM - 11:50 AM: Sustainable Computing - Udit Gupta (Cornell)
11:50 AM - 12:00 PM: Conclusion - Tom St. John (Meta)

Program Overview

The tutorial will cover a range of different topics, including but not limited to the following:

Representative benchmarks for the ML domain

Benchmarks are instrumental to the development of the architecture and systems communities. We will cover the various ongoing efforts in ML benchmarking. More specifically, we will present MLPerf (mlperf.org), an ongoing industry-wide effort, involving over 30+ companies, to create a SPEC-like benchmark to standardize how we measure the “training” and “inference” performance of ML models, software frameworks, ML hardware accelerators, and ML cloud and edge platforms. We will discuss the influence of academic benchmarking efforts such as the Fathom benchmark suite from Harvard, the DAWNBench suite from Stanford University, and the TBD suite from the University of Toronto on the initial design of MLPerf. We will discuss the benchmarks in MLPerf and also elaborate on the subtle nuances of developing an ML benchmark for the industry, such as how models are chosen and prepared for benchmarking.

Challenges presented by existing frameworks

We will explain the common pitfalls in benchmarking ML systems. For instance, a common benchmarking trap is to assume that any two ML frameworks are naturally mathematically equivalent in their implementation of ML models. But no two frameworks (e.g., PyTorch, TensorFlow, MXNet, Caffe2) are truly alike. There are pros and cons to each frameworks’ implementation, and understanding these subtleties is critical to correctly benchmarking systems and understanding the performance of various ML/DL models. Apart from ML frameworks, other factors play an important role in benchmarking, such as the pre- and post-processing steps that are essential for running the benchmarks inside a framework, other supporting software libraries and their versions needed to compile the framework, and architecture configurations of the underlying computing hardware. We will show that there are some non-obvious intricacies and subtleties that, if not well understood, can lead to "mysterious" inconsistent comparisons. Hence, as system researchers, we attempt to teach/showcase the importance of avoiding common benchmarking pitfalls of ML models on different frameworks, computing hardware, the supporting software stacks, and the end-to-end benchmarking pipeline on different datasets.

Tools and Methodologies for Evaluating platforms

There is a need for evaluation platforms that can enable the evaluation of different hardware platforms, software frameworks, and models, all across cloud and edge systems. We will introduce a set of open-source evaluation platforms that are hardware/software agnostic, extensible and customizable platform for evaluating and profiling ML models across datasets/frameworks/hardware, and within different AI application pipelines. We will demonstrate the set of tools and methodologies built under the TBD Benchmark Suite from the University of Toronto and Microsoft Research with the key focus on analyzing performance, hardware utilization, memory consumption, and also different performance aspects (networking and I/O) related to distributed training. We will also cover an open-source evaluation platform, called MLModelScope, from the IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) which lowers the cost and effort for performing model evaluation and profiling, making it easier to reproduce, evaluate, and analyze accuracy, performance, and resilience claims of models, frameworks, and systems. All major frameworks, hundreds of models, data sets, and many major hardware types are available under it.

Expected Outcome

The industry sees a need to educate students and professionals in the art of ML benchmarking and analysis. Deep learning is a complex space that requires optimization of algorithms, software and hardware stacks. So, our goal in this workshop is that when an attendee leaves the tutorial they have a good sense of the ML landscape, understand the state of affairs in ML model benchmarking for conducting research, appreciate the research value of ML benchmarking, and also learn the existing tools to debug and analyze the performance of his/her ML/DL models. Ideally, the tutorial raises the bar for informed research and spark new ideas.

PRIOR EVENTS

Organizers

Tom St. John

Tom St. John is a software engineer at Meta AI where he serves as technical lead for MTIA training performance within the PyTorch AI Acceleration division. He also serves as the chair of the MLPerf automotive advisory board. Prior to his current role, he served as a technical lead for the Compute Platforms Group at Cruise and led the distributed machine learning performance optimization efforts within Tesla Autopilot. His research primarily focuses on the intersection of parallel programming models and computer architecture design, and the impact that this has on large-scale machine learning.

Carole-Jean Wu

Carole-Jean Wu is a Director of AI Research at Meta. She is a founding member and a Vice President of MLCommons – a non-profit organization that aims to accelerate machine learning innovations for the benefits of all. Dr. Wu also serves on the MLCommons Board as a Director, chaired the MLPerf Recommendation Benchmark Advisory Board, and co-chaired for MLPerf Inference. Prior to Meta/Facebook, Dr. Wu was a professor at ASU. Dr. Wu’s expertise sits at the intersection of computer architecture and machine learning. Her work spans across datacenter infrastructures and edge systems, such as developing energy- and memory-efficient systems and microarchitectures, optimizing systems for machine learning execution at-scale, and designing learning-based approaches for system design and optimization. She is passionate about pathfinding and tackling system challenges to enable efficient and responsible AI technologies.

Vijay Janapa Reddi

Vijay Janapa Reddi is an Associate Professor in John A. Paulson School of Engineering and Applied Sciences at Harvard University. Prior to joining Harvard University, he was a an Associate Professor at The University of Texas at Austin where he continues to be an Adjunct Professor. He leads the MLPerf inference benchmarking effort. His research interests include computer architecture, compilers and runtime systems, specifically in the context of mobile and edge computing systems to improve their performance, power efficiency, and reliability.