Liben Chen - Physics of Large Language Models

Summary of Allen-Zhu’s ICML 2024 Tutorial on Physics of Large Language Models (LLMs)

Summer 2024

This article summarizes the tutorial titled "Physics of Large Language Models (LLMs)" presented by Allen-Zhu at the ICML in July 2024. It extracts some key points and conclusions from his presentation, integrating them with my prior understandings of LLMs. The tutorial was divided into three parts: the first part discussed the capabilities of LLMs in knowledge extraction, manipulation, and compression; the second part covered the reasoning and planning abilities of LLMs; and the third part focused on LLMs' learning capabilities related to structured language. The third part, which involves arguments about LLMs' planning capabilities, is more of an extension and complement to the second part. Therefore, in this summary, we have merged the original second and third parts of the tutorial. The tutorial reflects Allen-Zhu and his team's detailed observations and systematic thinking on the underlying working mechanisms of LLMs. It includes some surprising and universally applicable observations, practical suggestions for implementing LLMs, and far-reaching insights, which have garnered widespread attention.

I. Why Are Insights from this Tutorial Important in Practice?

The industry has been heavily invested to embark on training and deployment of large models. For most companies, algorithm development might be the least resource-consuming aspect; they typically use well-proven, effective algorithms as the foundation, with only non-fundamental modifications. The most resource-consuming aspects are twofold: the collection and cleaning of high-quality data and the evaluation of trained LLMs. Currently, LLM models are usually pre-trained using datasets containing approximately 10 trillion to 50 trillion tokens. However, the amount of text data companies can collect far exceeds this quantity. Much of the data is excluded from pre-training datasets due to its low quality. So, what constitutes high-quality pre-training data? This is a question with different answers depending on the application scenario. For instance, if a company is training an LLM to help correct grammatical errors, most fragmented social media content would not be considered high-quality data (since it often contains grammatical errors). Conversely, if a company is training an LLM to automatically generate social media posts, social media data is essential. Despite the varying definitions of high-quality data across different application scenarios, there are some common expectations, such as providing sufficient textual diversity. But what exactly is diversity? How can it be quantified? What type of diversity will make LLM training more efficient? These questions remain unanswered, but Allen-Zhu’s research attempts to address them and offers practical guidelines.

Another challenging issue faced by the industry at the data level is that, after depleting internet data, acquiring and cleaning additional data becomes extremely costly. One direct method is to use trained models to generate new data for subsequent training, but studies have shown that model-generated data does not improve model performance and outcomes as much as human-generated text data [1]. Therefore, the problem of how to cheaply obtain new training data remains unresolved. Allen-Zhu suggested in his presentation that one possible approach is using synthetic data: data that is neither generated by LLMs nor humans but is simulated based on specific rules. For example, he demonstrated how to generate a large number of complex elementary school math problems using Directed Acyclic Graphs (DAGs), which depicts the dependencies between variables in the problems. Although this example does not completely solve the above issue, it provides a valuable direction for practice.

Apart from data-related issues, the biggest challenge in implementing LLM applications is: how to evaluate the performance of a trained LLM? How to establish evaluation standards? Due to the inherent ambiguity of natural language, it is often difficult to establish clear and quantifiable evaluation criteria in many application scenarios. At the same time, directly collecting feedback from users is usually more expensive. A simple and direct evaluation method is to use one LLM to evaluate another LLM. However, an unavoidable problem with this method is that each LLM has its own limitations (for example, every LLM has the problem of hallucinations), and these limitations can lead to biased evaluations. This is a tricky problem, and there is currently no standard answer in the industry and academia (largely because each application scenario is different, so the evaluation criteria are also different). However, there is some consensus on this issue: for example, a model that masters/compresses more knowledge does not always perform better than a model with less knowledge. Therefore, we cannot evaluate a model based on how much knowledge it possesses. This is intuitively understandable; for humans, it is the same. A specialist's knowledge can be much less than that of a generalist, but in their respective fields of expertise, specialists usually perform better. Allen-Zhu's research on LLM evaluation suggests that the datasets used for evaluation can also be synthetic data (such as the simulated elementary school math workbooks mentioned above). Evaluating LLMs on such data has several advantages: first, following fixed rules and logic can produce large amounts of these evaluation sets, thus avoiding issues like data contamination (current public benchmark evaluation sets are fixed and limited, which may be included in LLM training sets, leading to cheating); second, by directly adjusting the rules and logic for generating synthetic evaluation sets, companies can directly control the difficulty of the evaluation sets and the types of abilities being evaluated.

II. First Part: Knowledge Extraction, Manipulation, and Compression Capabilities of LLMs

Before discussing Allen-Zhu’s perspectives and arguments, we first introduce the methodology he used to derive these insights. He advocates for a controlled experiment approach to derive general conclusions about LLMs. Controlled experiments involve precisely controlling the training data and algorithms, then observing the outcomes after changing some part of the data or algorithms, and further drawing conclusions from these observations. In the first part of the tutorial, he mentioned that he used ten fictional biographies as training input for LLMs. Because the data is fictional, he could precisely control the amount of information and the composition of the text according to research needs. This approach differs from previous studies on fully trained LLMs, where it was difficult to attribute observed behaviors or problems of LLMs to data or algorithmic issues. Controlled experiments allow better separation of the effects of data and algorithms on the LLM’s final behavior.

Knowledge Extraction

Knowledge extraction is a crucial capability of LLMs. But how do we tell whether an LLM has mastered knowledge extraction? Specifically, it is often challenging to distinguish whether a model has simply memorized information or genuinely learned the ability to extract knowledge. To address this, Allen-Zhu designed an experiment using mixed training: he trained the model on half of the biographies (with both biography content and related questions, such as “When is A’s birthday?”), then asked the LLM questions about the other half of the biographies (with questions similar to those in the training set). If the LLM has truly learned how to extract knowledge from language structure, it should be able to answer accurately. His experiment found that the LLM almost always answered accurately, which is strong evidence that the LLM has mastered knowledge extraction.

However, mixed training is not a common training method for LLMs. In fact, both academia and industry typically use pretraining followed by finetuning. Therefore, he further tested the LLM's knowledge extraction ability under this common training regime. He used half of the biographies for pretraining, finetuned the model with questions about these biographies, and then tested its performance on the other half of the biographies. Surprisingly, the LLM was unable to answer accurately (with test accuracy near zero)! After experimenting with different LLMs and finetuning methods, he found that this phenomenon was generalizable. This interesting observation contradicts reality: models like GPT-4, trained using pretraining and finetuning, clearly possess strong knowledge extraction abilities. Why does the experimental result contradict this fact? The answer lies in the data used. In the experimental environment, each piece of knowledge (e.g., “A’s birthday is July 28”) appeared only once. However, in GPT-4’s training data, a piece of knowledge may appear dozens of times in different forms (e.g., different languages, writing styles, and word orders). Allen-Zhu termed this phenomenon, where the same piece of knowledge appears in different forms in the training set, as knowledge augmentation. He found that when knowledge augmentation exists, the LLM fundamentally changes the way it associates information, enabling the LLM to acquire knowledge extraction capabilities. (Allen-Zhu employed probing to see how it gets fundamentally changed. The reader is kindly referred to Appendix for explanations on this process.) Moreover, he noted that it is unnecessary to augment the knowledge of all individuals in the biography data; if knowledge augmentation is applied to some individuals, the model can generalize this extraction ability to others. This is intuitive – after all, knowledge extraction is a generalizable ability; once LLM masters it, it can apply to different individuals easily.

Knowledge Manipulation

In this part of the tutorial, the most intriguing experimental conclusion relates to knowledge inverse search. To explain inverse search, consider this example: in the training dataset, there is knowledge like “A’s birthday is July 28”; normal knowledge search would involve asking, “When is A’s birthday?” while inverse search would be, “Who was born on July 28?” The experiment found that LLMs had near-zero accuracy on test sets when faced with this inverse search, hence the term “reversal curse.” The simplest way to solve this problem is to reverse word orders of the relevant parts of the corpus in the training set [2].

Knowledge Compression: Knowledge Capacity Scaling Law

As mentioned earlier, one benefit of controlled experiments is the ability to precisely control the amount of knowledge in the data. This allows for examining how much knowledge a model can learn based on its parameters. Here, knowledge is quantified in bits. For example, a person’s gender, having two possibilities, corresponds to 1 bit of knowledge (log₂2). Similarly, a person’s birth month, with 12 possibilities, corresponds to log₂12 bits of knowledge. The experiment found that current large models follow the rule that 1 parameter can compress 2 bits of knowledge; Allen-Zhu termed this the Knowledge Capacity Scaling Law. As the number of model parameters increases, the knowledge capacity grows linearly at this rate. Observing the Knowledge Capacity Scaling Law implies that we can estimate how many parameters are needed for a model to contain most human knowledge. Allen-Zhu’s answer is that a 7-billion-parameter model can compress all human knowledge included in English Wikipedia and textbooks. However, almost all current large models exceed 7 billion parameters (e.g., the LLAMA2 model has a 70-billion-parameter version). This suggests that these models' compression abilities likely fall short of the ideal: their compression capacity may be less than 2 bits per parameter. Thus, Allen-Zhu predicts that in the coming years, more methods will be discovered to enable LLMs to compress more knowledge with fewer parameters, eventually reaching the 2-bit-per-parameter Knowledge Capacity Scaling Law.

How can the efficiency of LLM knowledge compression be improved? Allen-Zhu suggests knowledge augmentation as one possible way. When knowledge is repeated in different forms enough times in the dataset, LLMs' compression efficiency significantly improves. When the repetition rate is sufficiently high, the compression abilities of different LLMs do not differ much; they all follow the Knowledge Capacity Scaling Law. However, when the repetition rate is insufficient, some LLMs (e.g., LLAMA) are more negatively affected in their knowledge compression capacity. Through controlled experiments, Allen-Zhu discovered that this is due to LLAMA’s use of gated MLP. When gated MLP is replaced with vanilla MLP, LLAMA’s knowledge compression ability improves.

At the end of this part of the tutorial, Allen-Zhu mentioned an experimental observation: when the quality of the training dataset is low, the model's compression performance also declines. Intuitively, this is easy to understand. If you try to find and remember useful knowledge from a large amount of irrelevant information, your efficiency in finding and remembering useful knowledge will decrease as the amount of irrelevant information increases. Therefore, the industry spends a lot of time on data cleaning to obtain high-quality data. At the same time, based on experimental results, Allen-Zhu suggested that for internet data, simply appending the domain name to the data in the training set allows LLMs to automatically recognize which domains usually contain higher-quality data during training, thereby greatly improving knowledge compression efficiency.

III. Second and Third Parts: LLMs' Reasoning and Planning Abilities

In this part of the tutorial, Allen-Zhu explored LLMs' reasoning and planning abilities through experimental observations, revealing the foundation of these capabilities in LLMs. In experiments examining LLMs' reasoning abilities, his research team created synthetic data—a dataset automatically generating many elementary school math problems using Directed Acyclic Graphs (DAGs). For example, consider a problem like, "An apple costs 10 dollars, a peach costs two apples, and two bananas cost three peaches. How much does a banana cost?" The relationship between the variables in this math problem can be summarized into a DAG: "Apple → Peach → Banana." This DAG indicates that to calculate the value of bananas, you first need to know the price of peaches, but calculating the price of peaches depends on information about the price of an apple. Of course, in more complex math problems, there will be more variables, and some variables may not even be necessary for solving the problem.

In the above example, we illustrated constructing a DAG from a math problem; conversely, we can also construct many math problems from DAGs. After obtaining these math problems, Allen-Zhu’s research team described them in natural language and used them to train LLMs. They then explored two interesting questions:

1. Before the model begins solving the problem, does it already know which variables are necessary to solve it?

2. During the model's problem-solving process (after the model starts to answer the problem), does it know which variables need to be calculated first and which need to be calculated later?

They found through probing that, even before starting answering the problem, the model already knows which variables are necessary. Additionally, during problem-solving, the model understands the dependencies between variables, knowing which variables need to be calculated first. Further probing revealed an even more surprising discovery: before starting output the answer to the problem, the model could already “mentally” masters all the dependencies between variables, even those not necessary for solving the problem. This indicates that the model's reasoning ability allows it to clarify the dependencies between variables immediately after seeing the problem. This differs from how humans typically solve math problems: when people realize that a variable is unnecessary for solving a problem, they generally do not try to clarify its dependencies with other variables.

Another important aspect of intelligence is planning. So, does LLM have planning capabilities? Allen-Zhu designed another experiment to provide evidence for this. In this task, he created a Context-Free Grammar dataset. Simply put, in this dataset, it is difficult for LLMs to understand the meaning of a token through local context (e.g., neighboring words). This means LLMs almost always need to use global context (i.e., the entire text) to understand the meaning of a token. This dataset requires LLMs to have a dynamic-programming-like capability to understand the text's meaning. Their experimental results found that LLMs successfully learned semantics from this dataset, indicating that LLMs possess dynamic-programming-like capabilities. Since dynamic programming is typically considered a basic planning capability, this experiment provides evidence that LLMs have fundamental planning abilities. In further theoretical analysis, Allen-Zhu and his team found that the LLM's attention matrix approximates the state transition equation in dynamic optimization, recording the transition probabilities between different tokens in the sequence.

IV. Postscript

Some of the observations in this tutorial are surprising, but I cautiously remind readers that some findings and experimental observations mentioned above have not yet been published and thus have not undergone peer review. However, the research paradigm of designing synthetic data combined with probing LLMs' inner workings demonstrates promising potential: this paradigm allows researchers to study smaller LLMs and derive generalizable principles and findings about LLMs.

Appendix: Understanding the Internal Information Relationships within LLMs through Probing

An LLM is an autoregressive model. At each step, it outputs a hidden variable—this hidden variable summarizes information from all previous steps (hence it is sometimes referred to as the context vector). Probing involves examining whether this hidden variable contains specific information. For example, if we input the sentence "Anya was born on July 28 and later worked at Microsoft" into a trained LLM, the LLM will output a corresponding hidden variable at each word. Suppose we truncate the sentence at the point "…later worked at". We can obtain a hidden variable that summarizes the information up to the phrase "later worked at", and this variable can also be used to predict the next word in the sentence. Therefore, the hidden variable at the position "later worked at" can predict which company Anya worked at. If we use this hidden variable to make a classification prediction (since the company is a discrete variable), the prediction would be Microsoft. Now, can the hidden variable at the position "Anya" also predict Anya's company?

Allen-Zhu discovered that when using the mixed training approach, the hidden variable at the position "Anya" could effectively predict the company's name. However, when using the pretrain+finetune training regime, the hidden variable at the position "Anya" could no longer predict the company name! This explains why, under the pretrain+finetune training method, the model loses its knowledge extraction capability—because the model is unable to associate the company information with "Anya". As a result, when faced with the question "Where does Anya work?", the LLM loses its ability to answer accurately. However, if we perform knowledge augmentation on the training set, even with the pretrain+finetune training method, the LLM can regain its knowledge extraction capability. This is because when the same piece of knowledge appears in different forms (e.g., different languages, different word orders) in the training set, the way the model internally associates information fundamentally changes (the hidden variable at "Anya" transitions from being unable to associate with the company information to being able to associate with it).