Training Large-scale Foundation Models on Emerging AI Chips
Tuesday, 8 August, from 2:00pm to 5:00pm in room 202B
Abstract
Foundation models such as ChatGPT and GPT-4 have garnered significant interest from both academia and industry due to their emergent capabilities, such as few-shot prompting, multi-step reasoning, instruction following, and model calibration. Such capabilities were previously only attainable with specially designed models, such as those using knowledge graphs, but can now be achieved on a much larger scale with foundation models. As the capabilities of foundation models have increased, so too have their sizes at a rate much faster than Moore's law. For example, the BERT large model, released in 2018, was a 334M-parameter model. The Pathways Language Model (PaLM), released in 2022, was trained with 540B-parameter, which represents an increase of more than three-order of magnitude in just 4 years. The training of foundation models requires massive computing power. For instance, training a BERT model on a single state-of-the-art GPU machine with multi-A100 chips can take several days, while training GPT-3 models on a large multi-instance GPU cluster can take several months to complete the estimated 3*10^23 flops.
This tutorial provides an overview of the latest progress in supporting foundation model training and inference with new AI chips. It reviews progress on the modeling side, with an emphasis on the transformer architecture, and presents the system architecture supporting training and serving foundation models. This includes programming language frameworks such as PyTorch and Tensor Flow, graph compilers, 3D parallelisms, and accelerators such as the GPU H100, TPU, and Trainium. Finally, the tutorial presents our experience of training foundation models using different systems.
Slides
You can download the slides here.
Panel Discussion
To wrap up the tutorial, we will have a joint 30 minute panel discussion with the speakers, Professor Yiran Chen from Duke University, and Professor Yizhou Sun from UCLA about recent advancements in AI hardware, the importance of software-hardware co-design for new AI chips, and the democratization of AI research and training.
Speaker's Bio
Jun (Luke) Huan is a principal scientist at AWS AI Labs. Dr. Huan works on AI and Data Science. He has published more than 160 peer-reviewed papers in leading conferences and journals and has graduated eleven Ph.D. students. He was a recipient of the NSF Faculty Early Career Development Award in 2009. His group won several best paper awards from leading international conferences. Before joining AWS, he worked at Baidu research as a distinguished scientist and the head of Baidu Big Data Laboratory. He founded StylingAI Inc., an AI start-up, and worked as the CEO and Chief Scientist in 2019-2021. Before joining industry, he was the Charles E. and Mary Jane Spahr Professor in the EECS Department at the University of Kansas. From 2015-2018, Dr. Huan worked as a program director at the US NSF in charge of its big data program.
Yida Wang is a principal scientist in the AWS AI team of Amazon. His research interest is in systems, high-performance computing, and big data analytics. He currently works on deep learning systems, with a focus on compiling and optimizing deep learning models for efficient training and inference, especially the large-scale foundation models. The mission is to bridge the high-level models from various frameworks and low-level hardware platforms including CPUs, GPUs, and AI accelerators, so that different models can execute in high-performance on different devices.
Youngsuk Park is a Sr. Applied Scientist at AWS AI Labs, Trainium-Extension science team for training and testing on foundation models on accelerators. Prior to that, he worked on R&D for Amazon Forecast as a lead scientist. His research lies in the interplay between machine learning, foundational models, optimization, and decision-making. In particular, Dr. Park is passionate about developing scientific methodologies on time series and its multi-modality across NLP with principled robustness and uncertainty quantification, leading to their productionization of a business impact. Before joining AWS, he obtained both MS and PhD in Electrical Engineering from Stanford University on the topic of convex optimization for machine learning with Prof. Stephen P. Boyd and graphical models with Prof. Jure Leskovec.
Aashiq Muhamed works as an Applied Scientist at AWS AI Labs, where he specializes in optimizing deep learning systems, including compilation, distributed training, and inference. He is particularly interested in the intersection of machine learning and distributed systems, as well as the challenges involved in automating deep learning system DevOps. Prior to his current role, Aashiq worked at Amazon Search, where he focused on optimizing large-scale semantic search models. Aashiq's academic background includes graduate studies at Stanford University, where he worked on problems at the intersection of learning, control, and high-dimensional simulation.
Rahul Solanki works as a Senior Machine Learning Apps Engineer in the AWS Neuron team of Amazon. He currently works on building framework/tools that can enable users to train/infer deep learning models on AI accelerators efficiently. His work also includes researching and building techniques that can enable distributed training of models. Prior to that, he has worked at Landing AI as a Machine Learning Engineer building/deploying computer vision applications. He obtained his Masters degree in ECE from Georgia Institute of Technology, Atlanta where his research lied at the intersection of vision and language.
Christian Bock is an Applied Scientist at AWS AI Labs, specializing in optimizing the training and inference of large language models on advanced machine learning accelerators using open standards like XLA. His research interests revolve around the interpretability of language models and their practical applications in healthcare, time series analysis, and topological data analysis. He holds a PhD from ETH Zurich, where his work focused on graph classification, studying the generalization behavior of deep neural networks, and devising algorithms for time series classification and pattern mining.