Data, Models and Simulations

数据、模型与仿真 Data, Models and Simulations

作为一个科研人，我的日常工作就是与各种各样的数据、模型与仿真打交道。尤其是在交通这样一个涉及领域比较交叉、所研究对象又非常复杂的领域，所研究方向偏向数据、建模与仿真的学者在领域内都有各自的影响力。由于学者之间"文人相轻"的光荣传统，不同研究偏向的人互相之间常常针锋相对。做建模的学者常常自负于思考的深度与数学技术上的娴熟度，对另外两者往往发出"就这点技术含量也能算研究吗？"的质问；做仿真的学者多强调自己研究的实际价值，对其余方向有时会有"自娱自乐"之批判；做数据的人由于乘上了人工智能的东风，往往在影响力上远超其余两者，因此难免会偶有飘飘然之感。近年来，大模型的成功深刻地改变了学术界的审美，对数据的偏好急剧增加，很多人甚至已经认定了未来的研究必然是数据为王，以前的研究范式将会被彻底推翻。本文试图从"第一性原理"出发，用一个统一的逻辑框架来理解这三类研究各自的价值，从而以此窥探未来的学术图景将会以怎样的形式演变。

As a researcher, my daily work revolves around interacting with various types of data, models, and simulations. Especially in a field like transportation, which is highly interdisciplinary and involves incredibly complex research subjects, scholars who focus on data, modeling, or simulation each hold their unique influence within the domain. Due to the "honorable tradition" of intellectual rivalry among academics, those with different research orientations often find themselves at odds with each other. Modeling-focused scholars tend to pride themselves on the depth of their thinking and their technical prowess in mathematics, often questioning others with remarks like, "Is that all it takes to count as research?" Simulation-focused researchers, on the other hand, emphasize the practical value of their work, sometimes criticizing others as engaging in "self-indulgent pursuits." Meanwhile, those who work with data have ridden the wave of artificial intelligence, gaining influence far beyond the other two. As a result, they can occasionally come across as somewhat overconfident. In recent years, the success of large models has profoundly reshaped academic preferences, with a dramatic surge in favor of data-driven approaches. Many have even concluded that the future of research will inevitably be dominated by data, and that previous paradigms will be completely overturned. This article seeks to approach the topic from "first principles," using a unified logical framework to understand the unique value of each of these three research orientations, and thereby explore how the academic landscape of the future might evolve.

首先需要明确的是，在数据、模型与仿真这三个研究范式中，数据与其它两者可以区分开来，被分为更上层的两类。数据类的研究，从底层逻辑而言，是一种"归纳"，即从或少量或海量的个例中总结出或简单或复杂的规律，从而加强人们对具体问题的理解，以便能够对事物做出更加精准的预测。即使是近年来如火如荼的大语言模型也跳不出上述的逻辑框架，其成功依赖于极大的语料数据量与非常复杂的模型结构（往往用参数数量来衡量），从而能够成功进行"下一词预测"。而模型和仿真类研究，从底层逻辑上来讲则是"演绎"，即：根据一组给定的规则集合（在该类研究中常以假设的形式出现），一步步推演出所研究问题的结果。而模型和仿真之间的区别则更多是"量"的区别而非"质"的区别：模型通常被认为是形式较为简化的、甚至能够通过数学推理得到确定性的结论的演绎形式；而仿真则是更加复杂的演绎形式，其所基于的规则集合较为庞大，往往需要借助计算机的力量来完成从输入到输出的转换。因此，数据与建模/仿真实际上代表了人类对科学问题的认知过程本身：通过数据来归纳出规则，而通过将这些规则代入建模/仿真，从而演绎得到设想场景的结果。明确了以上的基本原则后，我们对于数据、模型与仿真各自所起到的作用实际上就相对明确了。打一个不太严谨的比方，数据类的研究更接近于"科学"的意涵，重在总结与发掘已有事物的规律；而建模/仿真则更接近于"工程"的意涵，重在创造现实中还不存在的事物。

First, it is important to clarify that within the three research paradigms of data, modeling, and simulation, data can be distinguished from the other two, forming a higher-level dichotomy. From a fundamental logical perspective, data-oriented research represents a form of "induction"—summarizing patterns, whether simple or complex, from a set of instances, which may be sparse or vast. This enhances our understanding of specific problems, enabling more precise predictions. Even the much-discussed large language models of recent years fall within this logical framework: their success relies on massive amounts of corpus data and highly complex model structures (often measured by the number of parameters), allowing them to achieve accurate "next-word prediction." In contrast, modeling and simulation research are grounded in "deduction", which involves deriving results for a given problem step-by-step from a predefined set of rules—often presented as assumptions in this type of research. The distinction between modeling and simulation is more quantitative than qualitative: models are typically considered simplified forms of deduction, often capable of yielding deterministic conclusions through mathematical reasoning. Simulations, on the other hand, are more complex, based on larger sets of rules, and often require computational power to transform inputs into outputs. In essence, data versus modeling/simulation represents the two core cognitive processes by which humans approach scientific problems: induction (deriving rules from data) and deduction (applying these rules through modeling/simulation to predict outcomes in hypothetical scenarios). Once this foundational principle is established, the respective roles of data, modeling, and simulation become relatively clear. To use a somewhat imprecise analogy: data-oriented research aligns more closely with the essence of "science", focusing on uncovering and summarizing the laws of existing phenomena. Meanwhile, modeling and simulation align more closely with the essence of "engineering", emphasizing the creation of entities or systems that do not yet exist in reality.

然而，这样的总结跟科研中的一些经验似乎不甚符合，因为很多经典的建模/仿真类研究实际上就是用于研究已有事物的。例如在交通领域，LWR交通流模型与交通网络均衡模型就是分别用于刻画微观、中观交通流的数学模型，而这些都是已经存在的事物，并非从未出现过的事物。大名鼎鼎的广义相对论也是爱因斯坦通过精妙的演绎得到的。怎么解释这些与上述推论的不符合之处呢？这事实上是由于计算机发展的历史所造成的结果。个人计算机的普及只有短短30年左右的历史，并且直到最近十多年个人计算机才开始拥有较强的计算能力。这就意味着，在本世纪之前，从小数据甚至零数据的状态去演绎对事物的认知模型几乎是一种必然选择，而且这样的选择在物理领域的巨大成功（相对论、量子力学、量子场论）也激励了其他学科选择相似的技术路线。但是，这样的路线在涉及到复杂系统、尤其是与人的行为相关的复杂系统的时候，很难复制在物理领域的成功。进入新世纪后，计算机硬件、信息科学、机器学习的迅猛发展使得海量数据的获取成为可能，同时也诞生了神经网路这样的拥有巨大参数量的复杂模型，大数据与复杂模型的结合使得人类对于复杂系统的建模能力得到了质的飞跃，逐步攻克了计算机视觉、自然语言处理等极其复杂的任务。如同半个世纪前一样，这样的成功蔓延到了所有的相关学科，使得这些学科都轰轰烈烈地展开了通过数据驱动的方法代替领域内的经典模型的大革命。这也就是最近这一轮数据类研究的热潮的本质。

However, this summary seems somewhat inconsistent with certain experiences in research, as many classical modeling/simulation studies are actually aimed at studying existing phenomena. For example, in the field of transportation, the LWR traffic flow model and the traffic network equilibrium model are mathematical models used to describe microscopic and mesoscopic traffic flows, which are phenomena that already exist, not something entirely novel. Similarly, Einstein’s renowned general theory of relativity was derived through brilliant deductive reasoning. How do we explain these apparent discrepancies with the aforementioned conclusions? This inconsistency is, in fact, a result of the historical development of computing. The widespread availability of personal computers has been a reality for only about 30 years, and it is only in the past decade or so that personal computers have started to possess significant computational power. This means that prior to this century, building cognitive models of phenomena based on small or even zero datasets was almost an inevitable choice. The tremendous success of this approach in physics (e.g., relativity, quantum mechanics, quantum field theory) further inspired other disciplines to adopt similar technical routes. However, this route proves challenging to replicate in fields involving complex systems, especially those related to human behavior. Entering the 21st century, the rapid development of computer hardware, information science, and machine learning has made it possible to obtain massive amounts of data, while also giving rise to complex models like neural networks with vast numbers of parameters. The combination of big data and complex models has brought about a qualitative leap in human capabilities for modeling complex systems, gradually conquering extremely challenging tasks such as computer vision and natural language processing. Much like half a century ago, these successes have spilled over into all related disciplines, sparking widespread efforts to replace classical models within those fields using data-driven methods. This phenomenon forms the essence of the recent surge in data-oriented research.

可以预见的是，上述的"数据"类研究的风潮还会在各个领域继续火一段时间，直到数据获取的能力达到一个阈值为止。对于一些较难获取大量数据的学科而言，领域的一些经典模型将会一定程度和数据驱动的范式进行融合，形成类似于"物理信息模型（physics-informed models）"这样的结合体。在那之后，各个领域的重点一定会聚焦到"工程"问题上，即：如何去推演还未出现过的事物？以及，如何去对未知的未来进行决策？从底层逻辑而言，建模/仿真类的研究对这一类问题是必不可少的，因为我们必然需要对从未出现过的事物进行演绎。因而，我个人的判断是，这一轮对数据的研究热潮过去后，演绎类的研究热度会上升。与过去不一样的是，将来的演绎类研究将会基于有海量数据支撑的更加复杂、更加准确的规则集合，演绎的推动将会更加依赖计算机和人工智能的辅助，而不是像过去那样的过于简化的演绎形式。

It can be anticipated that the current wave of data-oriented research will continue to dominate various fields for some time, until the ability to acquire data reaches a certain threshold. For disciplines where it is relatively difficult to obtain large amounts of data, some classical models within the field are likely to integrate with data-driven paradigms, forming hybrids such as "physics-informed models." After this phase, the focus across disciplines will inevitably shift to "engineering" problems, specifically: How can we deduce and create what has not yet appeared? And how can we make decisions about an unknown future? From a fundamental logical perspective, modeling/simulation research is indispensable for addressing such questions because it is necessary to deduce and explore uncharted possibilities. Thus, my personal judgment is that after this surge of data-oriented research subsides, the focus on deductive research will rise. However, unlike in the past, future deductive research will be based on far more complex and accurate rule sets supported by massive datasets. Deduction will increasingly rely on the assistance of computers and artificial intelligence, rather than the overly simplified deductive methods of the past.

讲到这里，我们便可以引出建模与仿真之间的优劣分析。在现在这个时间节点，随着研究对象越来越复杂，很多建模类的研究为了得到一些解析的结论，不得不引入非常简化的规则集合，从而牺牲了对现实刻画的准确度。随着计算性能的继续发展，似乎未来的演绎研究可以完全建立在更加复杂而真实的仿真之上，使得简化的数学模型变得缺乏必要性。然而，计算性能的发展伴随着计算需求的快速增长，而算力在这个过程中大概率会一直处于供不应求的状态，因此在演绎过程中节约计算资源必将成为一个重要课题。对于很多演绎任务而言，高精度的仿真并不是必须的，对仿真这个"复杂映射"进行保精度的压缩会显著节约计算成本。从这个角度上来讲，"模型"与"仿真"之间的界限在未来会逐渐模糊，演变为从精简的"小模型"到庞大的"大模型"之间的一个连续统，这个连续统之上的任何一个点都会有其最适合的演绎、决策任务。建模的过程也会变得高度自动化，因为我们有足够的技术来对极度消耗计算资源的大仿真进行"蒸馏"。而那种通过人工进行建模、再通过繁杂的数学推理得到一些解析结论的"经典建模类研究"或许会逐渐式微，成为一种古早艺术一般的娱乐活动；因为随着人工智能的发展，人类未来或许并不再需要通过这样的方式来"理解"复杂现象。事实上，最近十年人工智能领域的重大突破往往都是基于工程实践中的尝试而非严谨的理论推演，已经能够在一定程度上佐证这样的趋势。

At this point, we can delve into a comparative analysis of the strengths and weaknesses between modeling and simulation. At the current juncture, as research subjects become increasingly complex, many modeling studies have to introduce highly simplified rule sets to obtain analytical conclusions, thereby sacrificing accuracy in describing reality. With the continued development of computational capabilities, it seems that future deductive research could be entirely built on more complex and realistic simulations, rendering simplified mathematical models unnecessary. However, the growth of computational performance is accompanied by rapidly increasing computational demands, and computational power is likely to remain in a state of relative scarcity throughout this process. Thus, optimizing computational resource use will inevitably become a crucial challenge in deductive research. For many deductive tasks, high-precision simulations are not strictly necessary. Compressing simulations—this "complex mapping"—while preserving accuracy can significantly reduce computational costs. From this perspective, the boundary between "modeling" and "simulation" will gradually blur in the future, evolving into a continuum ranging from simplified "small models" to massive "large models," with each point on this continuum suited to specific deductive or decision-making tasks. The modeling process itself is likely to become highly automated, as we will have sufficient technology to "distill" computationally intensive simulations into more efficient forms. The traditional form of "classical modeling research," where humans manually construct models and derive analytical conclusions through complex mathematical reasoning, may gradually fade into obscurity, becoming more of an artistic or recreational pursuit akin to a nostalgic craft. With the advancement of artificial intelligence, humanity may no longer need such methods to "understand" complex phenomena. In fact, the significant breakthroughs in artificial intelligence over the past decade have often stemmed from engineering practices rather than rigorous theoretical deductions, providing preliminary evidence of this emerging trend.

基于上述的讨论，作为归纳的数据类研究与作为演绎的建模/仿真类研究，在近期的未来依然会发挥其各自的作用；归纳类研究完全取代演绎类研究这样的可能性从"第一性原理"而言应当是不存在的。而在较为远期的未来，具有自主决策能力的超级人工智能的出现可能会实现科学研究领域的完全自动化推进，再天才的人类大脑在这样的人造神面前都会如同原始生物一般无力；在那个时候，我们或许也就不再需要去为这样的问题而烦恼了，科学研究作为一个人类曾经的伟大职业大概也将会寿终正寝。

Based on the above discussion, inductive data-driven research and deductive modeling/simulation research will continue to play their respective roles in the foreseeable future. The possibility of inductive research completely replacing deductive research, from the perspective of "first principles," is unlikely to exist. In the more distant future, the emergence of super artificial intelligence with autonomous decision-making capabilities may enable the complete automation of scientific research. Even the most brilliant human minds would appear as powerless as primitive organisms in the face of such artificial deities. At that point, we may no longer need to concern ourselves with these questions, and scientific research, once one of humanity's greatest endeavors, might meet its natural end.

Page updated

Google Sites

Report abuse