Plateau Period

大模型的“平台期”

On the “Plateau Period” of Large Models (August 2025)

2025年8月初，在经历了长时间的预热和万众期待之后，OpenAI的第五代基础语言模型GPT-5终于揭开神秘面纱。与两年多前GPT-4问世时那种“一骑绝尘”的震撼场面不同，如今的大语言模型领域早已群雄并起。谷歌DeepMind的Gemini、Anthropic的Claude、xAI的Grok，以及来自中国的通义千问和Deepseek等模型，都对GPT系列构成了强有力的竞争。因此，无数AI领域的关注者都翘首以盼，期待OpenAI能再次掀起一场现象级的技术狂潮。

In early August 2025, after a long period of buildup and widespread anticipation, OpenAI’s fifth-generation foundational language model, GPT-5, finally lifted its veil of mystery. Unlike the awe-inspiring “leaving all competition far behind” scene that accompanied the debut of GPT-4 more than two years ago, today’s large language model landscape is already crowded with formidable rivals. Google DeepMind’s Gemini, Anthropic’s Claude, xAI’s Grok, as well as China’s Tongyi Qianwen and DeepSeek, all pose strong competition to the GPT series. As a result, countless AI enthusiasts are eagerly watching, hoping that OpenAI will once again ignite a phenomenon-level wave of technological excitement.

事实上，GPT-5确实将大语言模型的能力推向了新的前沿。在发布后几天的广泛测试中，各项基准测试和实际应用都证明了其性能相比前代模型有了显著提升。一个著名的例子是，在颇具挑战性的《宝可梦红》游戏任务中，新模型仅用了6000步就集齐了八个徽章，而前代的强大推理模型o3则用了16700步。然而，尽管取得了这些进步，GPT-5在发布初期却依然遭到了部分舆论的嘲讽，许多人认为其能力提升并未达到外界的爆炸性预期。这其中，OpenAI在发布会上出现的一些低级失误需要为此“背锅”，但另一个深层次的关键因素则更有讨论的价值：大语言模型这条技术路线，是否真的快要触及一个“平台期”了？更有甚者，正如一些AI怀疑论者一直以来所主张的那样，这一轮由大模型引领的AI爆发是否也将步入一个沉寂期？

In fact, GPT-5 has indeed pushed the capabilities of large language models to a new frontier. In extensive testing conducted in the days following its release, both benchmark scores and real-world applications demonstrated a significant performance boost over its predecessor. A well-known example is the challenging Pokémon Red game task, where the new model collected all eight badges in just 6,000 steps, compared to the powerful reasoning model o3 from the previous generation, which required 16,700 steps.

However, despite these advances, GPT-5 still faced some mockery in public opinion during its early release phase, with many claiming that its improvements fell short of the explosive expectations from the outside world. Part of the blame can be pinned on several basic mistakes OpenAI made during the launch event, but there is another, deeper key factor that is more worthy of discussion: is the large-language-model technological path itself approaching a “plateau period”? More provocatively, as some AI skeptics have long argued, is this wave of AI breakthroughs driven by large models about to enter a lull?

在深入分析之前，必须承认：关于AI基础模型能力发展正在放缓的观点，并非空穴来风。大模型领域存在一个至关重要的经验性规律，被称为“尺度定律”（Scaling Law）。该定律的核心观点是，在模型架构保持不变的前提下，模型能力的增长与投入的资源（如模型参数量、数据规模和训练算力）之间，呈现出一种“幂律”关系。这也意味着，当模型的能力已经达到一个相当高的水平后，继续投入同等规模的资源，所带来的性能提升幅度会越来越小，即“边际效益递减”。因此，在AI大模型已经飞速发展数年之后，若不出现重大的架构革新，其能力增长放缓几乎是业内的一种共识性预判。

Before diving into deeper analysis, it must be acknowledged that the view that the advancement of AI foundation model capabilities is slowing is not without basis. In the large-model field, there is a crucial empirical principle known as the Scaling Law. Its core idea is that, when the model architecture remains unchanged, the growth in a model’s capabilities follows a power-law relationship with the resources invested—such as the number of model parameters, the size of the dataset, and the amount of training compute. This also means that once a model’s capabilities have reached a relatively high level, continuing to invest the same amount of resources will yield increasingly smaller performance gains—i.e., diminishing marginal returns. Therefore, after several years of rapid advances in AI large models, it is almost an industry-wide consensus that, without major architectural innovations, their rate of capability growth will inevitably slow.

但是，从另一个视角来看，当我们说“大众感受到大模型的发展进入了平台期”，它所覆盖的就不仅仅是技术发展本身了；这句话更多指的是大众对于大模型能力发展的感受，而不是客观的模型能力增长幅度。因此，除了尺度定律以外，我们更需要从一个社会心理学的角度来对这个命题进行分析。

However, from another perspective, when we say “the public feels that large-model development has entered a plateau,” this goes beyond the technology itself; the statement refers more to the public perception of large-model capability growth rather than the objective rate of capability improvement. Therefore, beyond the Scaling Law, we also need to analyze this issue from a social-psychological standpoint.

大语言模型真正“破圈”进入公众视野是在2022年底。这一次破圈带来的震撼毫无疑问是现象级的，因为这是大多数人第一次真正见识一个可以与之进行日常对话的聊天机器人。紧接着，在2023年初，GPT-4的发布再一次引起了巨大的轰动，因为GPT-4相比于GPT-3.5在能力上的提升是非常显著的，以至于当时就引起了AI领域的很多研究员关于“在GPT-4中发现了通用人工智能的火花”的讨论。再下一次AI引起如此巨大的舆论可能就是2024年底Deepseek的异军突起了，但这一次更多不是大模型能力本身的突破，而是混杂了地缘政治对抗和民族情绪等复杂要素。

Large language models truly “broke out of the tech bubble” and entered the public spotlight at the end of 2022. The shock this breakthrough created was undoubtedly a phenomenon-level event, as it was the first time most people had experienced a chatbot that could engage in everyday conversation. Soon after, in early 2023, the release of GPT-4 once again caused a huge sensation, because its capabilities had improved so dramatically compared to GPT-3.5 that many AI researchers at the time began discussing whether “sparks of artificial general intelligence” had emerged in GPT-4. The next time AI stirred such a massive wave of public discourse was likely in late 2024, with the sudden rise of DeepSeek—but in that case, the excitement was less about a fundamental breakthrough in large-model capabilities and more about a mix of factors, including geopolitical rivalry and national sentiment.

从能力上来看，难道GPT-4之后的大模型就没有显著进步了吗？显然不是。仅从数学和编程两个方面来看，GPT-4在数学上大致相当于一个高中生/低年级本科生的水平，在编程上则只能以较高的准确率编写小规模的代码；而经过两年的发展，尤其是在2024下半年的推理模型问世后，如今最前沿的大模型在数学能力上已经毫无疑问超过了名校硕士生的平均水准，OpenAI与Google的内部实验模型甚至已经能够拿到2025年的IMO金牌（也即是在解题上已经达到了人类天才水平）；而在编程上，配合代码执行环境与智能体框架（Agentic framework），当前的一线大模型已经足以稳健地生成上千行的无差错代码。这样的进步速度完全可以说是比人类之中最强个体都要更快。但是，这样的巨大进步却没有再引起如同ChatGPT发布时的那般震撼。

In terms of capabilities, have large models really made no significant progress since GPT-4? Obviously not. Looking at just mathematics and programming: GPT-4’s math skills were roughly on par with a high school student or a lower-year undergraduate, and in programming it could only produce small-scale code with relatively high accuracy. After two years of development—especially following the emergence of reasoning models in the second half of 2024—the most advanced large models today have unquestionably surpassed the average standard of master’s students at top universities in mathematics. In fact, OpenAI’s and Google’s internal experimental models have already been capable of winning a gold medal at the 2025 IMO, meaning they have reached human genius level in problem-solving. In programming, when paired with a code execution environment and an agentic framework, current top-tier large models can now reliably generate thousands of lines of error-free code. This rate of improvement is arguably faster than that of the strongest individual humans. And yet, despite such enormous advances, the release failed to recreate the shockwave that accompanied ChatGPT’s debut.

至少有两个理由可以解释这一现象。

There are at least two reasons that can explain this phenomenon.

首先，从人的主观体验来看，ChatGPT 的横空出世对于许多人而言，是一种前所未有的冲击——它让人们第一次在日常生活中亲眼见证了一个能够自然对话、回答几乎任何问题、甚至写作和编程的 AI 系统。这种等同于“AI 突然降临”的历史时刻，彻底打破了此前“人工智能还很遥远”的普遍认知。此类带有戏剧性和颠覆性的震撼，往往只会在技术发展史上出现一次，后续无论 AI 再怎么改进，在心理层面的冲击力都很难与之相提并论。

First, from the perspective of human subjective experience, the sudden emergence of ChatGPT was an unprecedented shock for many people. It was the first time they had personally witnessed, in everyday life, an AI system that could converse naturally, answer almost any question, and even write and program. This was a historic moment equivalent to “AI suddenly arriving,” completely shattering the widespread belief that “artificial intelligence is still far away.” Such dramatic and disruptive moments usually happen only once in the history of technological development; no matter how much AI improves afterward, its psychological impact will be hard to match.

其次，正如我们刚才提到的，最近两年 AI 能力的飞跃（例如数学推理水平从相当于高中阶段快速提升到接近博士研究生的水准），对于绝大多数非专业人士来说，是一种难以具象感知的提升。只有极少数深度参与相关领域的人，才能切身体会这种质变背后的技术跨度。而对大多数用户而言，2023 年以来的 AI 进步，更多体现在一些较为细微的变化——比如幻觉减少了一些、表达更加自然了一些——至于模型在推理链条深度、跨模态融合能力、复杂任务分解等方面的实质性提升，则缺乏清晰的概念与衡量标准，因此难以在心理上重现 ChatGPT 初登场时那种震撼感。

Second, as mentioned earlier, the leaps in AI capabilities over the past two years, such as the rapid jump in mathematical reasoning from roughly a high school level to near that of a PhD student, are improvements that most non-specialists find difficult to concretely perceive. Only a very small number of people deeply involved in related fields can truly appreciate the technical leap behind this qualitative change. For most users, the progress in AI since 2023 has been reflected more in subtle changes—such as fewer hallucinations or slightly more natural expression—whereas substantial advances in areas like deeper reasoning chains, cross-modal integration, and complex task decomposition lack clear concepts or measurement standards. As a result, it’s hard for the public to mentally recreate the sense of awe they felt when ChatGPT first appeared.

值得注意的是，以上两种理由的根源都与大模型能力提升的实际幅度无关；它完全植根于人类的心理感知。这就将我们引回了最初的议题，并提供了一种截然不同的可能性：所谓的“平台期”，或许并非AI基础模型能力提升的客观现实，而更多是一种心理学层面的幻象。大模型的进步大概并未停滞，只是我们衡量进步的标尺和感受惊喜的能力已经因最初的震撼而变得迟钝了。这种技术进步的真实尺度，只有在我们刻意打破这种感知惯性时才会显现。这需要我们进行一次直接的对比：将如GPT-5这样的最新代模型，与其两三年前的“祖先”版本并排展示。

It is worth noting that the two reasons above have nothing to do with the actual magnitude of large-model capability improvements; they are rooted entirely in human psychological perception. This brings us back to our original topic and suggests a very different possibility: the so-called “plateau period” may not be an objective reality of slowing AI foundation model progress, but rather a psychological illusion. Large models may not have stalled in their advancement at all; rather, our yardstick for measuring progress and our capacity to feel surprise have been dulled by the initial shock. The true scale of technological progress only becomes apparent when we deliberately break this perceptual inertia, such as by directly comparing the latest generation, like GPT-5, side by side with its “ancestor” versions from two or three years ago.

如果沿着上述思路继续推演，我们就会得出这样一个判断：沿着当前路径持续演化的基础大模型，即便不断攻克各个细分领域的基准测试，也很难再引发现象级的社会震动。整个社会已经逐渐习惯了这样的新闻——某某模型又在某某榜单上取得了最高分。就算在2026年，一个模型在数学、物理、化学、信息学等国际竞赛中同时拿到满分，恐怕公众反应也只会是短暂的惊讶，而不会像 2022 年末 ChatGPT 发布时那样掀起席卷全球的讨论热潮。

Following this line of reasoning, we can arrive at the conclusion that foundation models evolving along the current pat, no matter how many benchmarks they continue to conquer in different subfield, are unlikely to trigger another phenomenon-level societal shock. Society has gradually grown accustomed to such headlines: “Model X achieves top score on Benchmark Y.” Even if, in 2026, a model were to score a perfect mark simultaneously in international competitions for mathematics, physics, chemistry, and informatics, the public’s reaction would likely be no more than brief surprise—nothing like the global wave of discussion that swept the world when ChatGPT was released at the end of 2022.

未来真正可能再次引发大轰动的突破，大概率不再是现在这种以问答交互为主的单体基础模型的分数提升，而是在强大基础模型之上，以其为“建筑材料”，配合精细设计的多智能体协作架构，将分散的能力组织起来，并充分释放潜力，进而实现超长流程、自主决策和多阶段任务执行的“超级智能系统”。这种系统不再只是回答问题，而是能够独立规划、跨领域调用工具、动态调整策略，在数小时、数天乃至数周的连续运行中稳定推进目标任务。另一方面，这类顶尖专家级别的智能系统在多个领域逐步成型后，其有机的整合又能够带来基础大模型能力的进一步跃迁。这与各个学科的学者将最新研究结论汇总到教材之中，让下一代人普遍得到认知升级是同一个道理。

The next breakthrough that could truly spark another major wave of excitement will most likely not come from score improvements in the current paradigm of single foundation models focused on Q&A interaction. Instead, it will emerge from building on top of powerful foundation models, using them as construction materials，and combining them with finely designed multi-agent collaboration architectures that organize dispersed capabilities, fully unleash their potential, and enable “superintelligent systems” capable of ultra-long processes, autonomous decision-making, and multi-stage task execution.

Such systems would no longer just answer questions. They would independently plan, draw on tools across domains, adjust strategies dynamically, and steadily advance target objectives over continuous operation lasting hours, days, or even weeks. On the other hand, once these top-tier, expert-level intelligent systems begin to take shape in multiple fields, their organic integration could, in turn, drive a further leap in the capabilities of foundation models themselves, just as scholars in different disciplines compile their latest research into textbooks, enabling the next generation to achieve a general cognitive upgrade.

换句话说，虽然在技术指标上仍有提升空间，基础大模型这个范式本身已经逐渐走向了竞争拥挤的“红海”；而真正的“蓝海”是在这个范式之上重构组织形式与交互机制，让 AI 从被动的答题者转变为主动的行动者。这也正好回应了 AI 怀疑论者的质疑——AI 爆发并没有进入沉寂期，只是正在经历一场从单一模型范式向系统化智能形态的过渡。

In other words, while there is still room for improvement in technical metrics, the foundation model paradigm itself is gradually becoming a crowded “red ocean” of competition. The true “blue ocean” lies in reconstructing organizational forms and interaction mechanisms on top of this paradigm—transforming AI from a passive answerer into an active agent. This also directly responds to the skepticism of AI doubters: the AI boom has not entered a lull, it is merely undergoing a transition from the single-model paradigm to a systemized form of intelligence.

Page updated

Google Sites

Report abuse