A Technology/Architecture/Algorithm Co-design Framework for Distributed Training
(June 18, 2021 in conjunction with ISCA 2021)
(12:00 pm- 3:00 pm ET)
Recent studies show that the compute requirement of deep learning applications is doubling every 3 months. Comparing this to Moore's law, where the number of transistors per chip only doubles every 3 years, the only path forward is to build scale-out multi-chip solutions for training deep learning networks. There is a need to increase not only the computational and memory capacity per AI accelerator chip but also the scale of the system.
From an algorithmic standpoint, this implies exploiting all forms of parallelism strategies, such as data, model, pipeline and hybrid parallelism, to just name a few. State-of-the-art deep learning problems are currently using parallelism across more than 1000s of GPUs and/or TPUs. However, these outrageous scaling attempts come at the cost of severe under-utilization for many state-of-the-art machine learning workloads -- 20% efficiency across 400 GPUs and 6% efficiency across 1000 GPUs.
Designing a multi-accelerator system is more challenging than a single-accelerator design. For one, the design space would grow in size, including both per-accelerator architectural parameters as well as inter-accelerator parameters. Second, there is a complex interplay between the features of the two groups. Third, the inter-accelerator design choices are often dictated by the parallelism strategies exposed at algorithmic-level, while the best parallelism strategy itself is dictated by the underlying system design. Existing approaches either focus on per-accelerator design or the inter-accelerator design. A large body of work focuses on designing the best accelerator for a specific set of applications or domains. On the other hand, researchers are proposing methodologies for mapping their ever-growing applications onto multi-accelerator systems, assuming the underlying system is given and unchangeable.
Designing a scale-out system for such large-scale machine learning training problems requires careful co-optimization of accelerator architecture, memory subsystem, inter-chip network and algorithmic parallelization approaches. Though there have been some efforts to standardize benchmarking of machine learning hardware, what is needed is "full-stack" pathfinding of accelerators. This is even more critical given that machine learning has emerged as the primary driving workload for future algorithms, architectures, circuits and semiconductor technology.
To enable such analysis, we have developed a model and pathfinding tool, DeepFlow, which captures the interplay between technology parameters (e.g., energy per flop, energy per bit access to different levels of memory hierarchy), AI accelerator parameters (e.g., compute throughput, memory bandwidth, memory capacity), cross-accelerator parameters (e.g., network bandwidth and network topology), model architecture parameters (e.g., computation graph, width, depth, sequence length), parallelism strategy (model parallelism, data parallelism, hybrid parallelism) and power budget as input and predicts performance (time-to-train) as output. In addition to accurately modeling technology, hardware and algorithm behavior.
Background
uArchitecture Generator Engine
Device Placement Engine
Performance Prediction Engine
Optimization and Search Engine
Performance Scaling Model
Validation Results
How to use DeepFlow to predict the performance of your desired model on your desired hardware accelerator in a distributed training setting with your desirable parallelism strategy?
How to use DeepFlow to do bottleneck analysis?
How to use DeepFlow to explore different technology scaling scenarios?
How to use DeepFlow to explore different parallelism strategies?
How to use DeepFlow to explore different hardware configurations?
How to use DeepFlow to co-design hardware, parallelism strategy and technology?
Saptadeep is currently working toward a PhD in the Department of Electrical and Computer Engineering at the University of California, Los Angeles. His research interests include scale-out system architectures and design of waferscale and chiplet-based processor systems.
Newsha is a research scientist at Facebook AI Research (FAIR). Her current work focuses on hardware/software co-design for extremely large-scale deep learning applications. She received her Ph.D. from UW-Madison in 2016.