The goal of this workshop is to introduce the TCS audience (specifically junior graduate students as well as more senior researchers who are interested) to fundamental algorithmic problems in the modern Large Language Model (LLM), or more generally the foundational model revolution. While one might disagree on how close we are to AGI (or if we are going to reach there in our lifetimes), LLMs have had an outsized impact on technology recently and it seems very likely that their importance will not wane anytime soon. More pertinently, adaptations of classical ideas from TCS (e.g., I/O-aware algorithms) have already started impacting recent advances in LLMs and we believe there is a huge opportunity for TCS to make even more contributions in this space in the (near) future.
The workshop is part of the TheoryFest at STOC'24, and will take place Monday June 24 to Wednesday June 26, 2024.
Monday, June 24
9:00am: Jana Kulkarni, Tutorial 1: Three Key Ideas in the LLM Revolution
9:55am: Break
10:05am: Atri Rudra, Tutorial 2: Abstracting away the Deep Learning Mumbo-Jumbo (with Arithmetic Circuits)
11:00am: Day 1 of workshop ends
Tuesday, June 25
9:00am: Dan Fu, Hardware-Aware Efficient Primitives for Machine Learning
9:55am: Break
10:05am: Simran Arora, Understanding and improving efficient machine learning models
11:00am: Day 2 of workshop ends
Wednesday, June 26
9:00am: Ashish Sabharwal, How Far Can LLMs Go? Computational Limits on the Underlying Neural Architectures
9:55am: Break
10:05am: Zeyuan Allen-Zhu, Physics of Language Models: from Newton, Kepler, Brahe to a Level-2 Reasoning Skill Beyond Humans
11:00am: Workshop ends
We are grateful to Microsoft Research for covering speaker participation costs (except for Zeyuan Allen-Zhu).
Title: Physics of Language Models: from Newton, Kepler, Brahe to a Level-2 Reasoning Skill Beyond Humans
Abstract: Recent advancements in LLMs have showcased their ability to solve math and reasoning problems, achieving near-perfect scores on benchmarks like GSM8K. However, do these models genuinely develop reasoning or algorithmic skills, or do they merely memorize solution templates? To answer this, we believe it is insufficient to compare benchmarks or conduct behavioral studies with models like GPT-4o.
In this talk, we explore how LLMs tackle grade-school math problems by addressing the following questions: (1) What is the model's hidden reasoning process? (2) Do models use reasoning skills similar to or different from humans? (3) Do models develop more general skills beyond those needed for GSM8K problems? (4) How large or deep must a model be to solve GSM8K-level math questions? Our study reveals hidden mechanisms in how LLMs solve mathematical problems, offering insights that extend beyond current understandings of LLMs.
More generally, our research does not stop at the math reasoning skills, but can also extend to other aspects of intelligence. We will highlight several examples and discuss how LLMs are secretly learning some algorithms, from topological sort, graph connectivity, to dynamic programming.
Title: Understanding and improving efficient machine learning models
Abstract: A key bottleneck in machine learning is compute: ML is succeeding at modeling text, code, and even DNA, but to reach the full potential, we need more resources than we currently have. The Transformer architecture, which powers industry models, requires compute and memory to grow with the input size, precluding dreams of modeling inputs containing millions of lines of code or the 3.2Bn nucleotide pairs in genome sequences. Candidate efficient language models (LMs), which seek to reduce the compute and memory requirements relative to Transformers, are emerging at a rapid pace. However, deviating from the Transformer-orthodoxy is a major risk: models are served to millions of users and take billions of dollars to train, and we have limited insight into how alternatives will impact quality.
In this talk, I first discuss our research on developing our understanding of why architectures work well, in order to distill them into their most efficient forms. Despite the breadth of skills required in language modeling (syntax, fact memorization, reasoning) and breadth of efficient LMs being proposed, our ICLR 2024 work found that a single skill called associative recall (AR) surprisingly explains 80%+ of the language modeling quality difference between Transformers and a popular class of efficient LMs built from Hadamard product and convolution operations. A LM performs AR if it recalls information that it has seen earlier in the input. We will use AR to theoretically and empirically explain the tradeoffs across several classes of efficient LM architectures. I will then share how in our ICML 2024 work (Spotlight top 3.5% of 10K papers), we exploited this understanding to develop new hardware-efficient ML architectures (BASED and JRT), which expanded the Pareto-frontier of the quality-efficiency tradeoff space beyond prior LMs.
Title: Hardware-Aware Efficient Primitives for Machine Learning
Abstract: Efficiency is increasingly tied to quality to machine learning, with more efficient training algorithms leading to more powerful models. However, today's most popular machine learning models are built on asymptotically inefficient primitives. For example, attention in Transformers scales quadratically in the input size, while MLPs scale quadratically in model dimension. In this talk, I discuss my work on improving the efficiency of the core primitives in machine learning, with an emphasis on hardware-aware algorithms and long-context applications. First, I focus on replacing attention with gated state space models (SSMs) and convolutions, which scale sub-quadratically in context length. I describe the H3 (Hungry Hungry Hippos) architecture, a gated SSM architecture that matches Transformers in quality up to 3B parameters and achieves 2.4x faster inference. Second, I focus on developing hardware-aware algorithms for SSMs and convolutions. I describe FlashFFTConv, a fast algorithm for computing SSMs and convolutions on GPU by optimizing the Fast Fourier Transform (FFT). FlashFFTConv yields up to 7x speedup and 5x memory savings, even over vendor solutions from Nvidia. Third, I will briefly touch on how these same techniques can also be used to develop sub-quadratic scaling in the model dimension. I will describe Monarch Mixer, which uses a generalization of the FFT to achieve sub-quadratic scaling in both sequence length and model dimension. Throughout the talk, I will give examples of how these ideas are beginning to take hold, with gated SSMs and their variants now leading to state-of-the-art performance in long-context language models, embedding models, and DNA foundation models.
Title: Three Key Ideas in the LLM Revolution
Abstract: In this tutorial, we will discuss three groundbreaking concepts that have paved the way for the development of advanced models like GPT-4, Gemini, and Claude. Additionally, we will cover fundamental concepts related to large language models (LLMs).
Title: Abstracting away the Deep Learning Mumbo-Jumbo (with Arithmetic Circuits)
Abstract: This tutorial will continue where Jana's tutorial left off with the goal of abstracting out the problems and models for language modeling. After setting up the basic setup, the tutorial will focus on the Transformer model- we will introduce a simplified abstraction of the Transformer model and show that it is sufficient to solve an abstract language modeling task-- Associate Recall (also known in other circles as the key-value store problem). We will then zero in on the quadratic complexity of the Transformer model as a key bottleneck. We will then focus on the computational constraints introduced by modern hardware (like GPUs and TPUs), which are used to run model deep learning pipelines. Finally, we will take a quick walk through of some beloved TCS ideas that do not work in this setup and some other ideas that do work. Specifically, we will focus on some of the ideas that will be expanded on in the talks on Tuesday and Wednesday. A key technical simplification will be our insistence on using arithmetic circuits to make sense of the deep learning world.
Title: How Far Can LLMs Go? Computational Limits on the Underlying Neural Architectures
Abstract: While bigger and better LLMs have continually surprised everyone with their amazing achievements, do they truly possess the potential to one day solve all our challenging computational problems? I’ll discuss a series of results formally showing that, even setting learnability aside, the highly parallel architecture of the neural networks beneath today’s LLMs (transformers and sometimes state space models, SSMs) fundamentally limits their potential. Drawing connections to classical TCS results across various models of computation (circuits, logic, and Turing machines), we will see how transformer encoders and large classes of SSMS are limited to a very small complexity class, uniform TC^0. This implies that even simple computational problems such as graph connectivity and state tracking (think narratives, code execution, chess in certain notation, etc.) are provably beyond their reach, under standard theoretical assumptions. We will see that this hard limitation can be overcome with intermediate generation in transformers, helping explain the empirically observed success of “chain of thought” prompting, and with input-dependent transition matrices in SSMs. Both of these features, however, come at the cost of imposing sequentiality on the architecture, pointing towards a "parallelism tradeoff" --- while the high parallelism of current architectures has enabled training on trillions of tokens, it has also placed fundamental limits on the expressivity of LLMs.