The 2022 “Virtual Digital Human Industry Chain Map” (虚拟数字人产业链图谱) is a single industry-structure diagram that lays out the sector as an upstream–midstream–downstream chain and is commonly reproduced in 2022 articles and reports with the attribution “Source: Toubao (LeadLeo) Research Institute and West China Securities Research Institute.” Upstream covers foundational capabilities and inputs that make digital humans feasible, including computing and chips, cloud and networking, sensors and capture hardware, and core software/tooling for 3D modeling, animation, rendering, speech/voice, NLP, and related AI components; midstream covers the “production and platform” layer that turns those inputs into usable digital-human products, typically including digital-human creation engines/pipelines, content production and operation services, and platform providers that offer creation, management, and deployment; downstream covers application deployment and commercialization across multiple scenarios, usually grouped into enterprise and public-facing uses such as marketing and brand spokescharacters, livestream/e-commerce hosting, customer service and reception, media and entertainment/virtual idols, education and training, finance and retail services, and government/cultural tourism style public services. The practical point of the map is not to rank companies but to show where value accumulates across the stack, how toolchains connect to content/operations platforms, and how the same underlying capabilities can be repackaged for different verticals; it is best treated as a taxonomy of roles in the ecosystem rather than a definitive inventory of every vendor.
This strip is the upstream “content production and tooling” layer of the virtual digital human industry chain, and it is essentially describing where a digital human is authored, technically constructed, and packaged as an IP asset before it is deployed into downstream applications. The left vertical label “上游” indicates it is positioned as a prerequisite layer for everything that follows: without production studios, DCC (digital content creation) software, real-time engines, and motion-capture inputs, the midstream platform and downstream scenario layers have nothing reliable to operate or distribute.
The first sub-block, “内容制作类” (content production), points to companies whose core value is building the digital human itself as an audiovisual and interactive asset: character design, look development, modeling, grooming, shading, animation, facial performance, and final production for film, advertising, games, or live events. The logos shown here include Digital Domain and Microsoft, alongside Chinese production players (for example, “原力,” commonly used as a shorthand brand in China’s CG/animation services ecosystem). In industry-chain terms, this category is less about “software tools” and more about labor, pipelines, craft quality, and proven delivery at scale. The reason it sits upstream is that it often supplies the first “hero” digital humans (and the production know-how) that later get repurposed into productized, reusable, or platform-managed digital humans.
The middle block, “工具类” (tools), splits into two functional toolchains that correspond to the core technical steps of building and running a digital human. The “建模及绑定” (modeling and rigging/binding) segment is the character-construction stack: sculpting and topology, UVs, materials, blendshapes, skeleton/skin weighting, and the rig controls that allow later performance driving. The appearance of Autodesk alongside well-known DCC brands in the image (Houdini, ZBrush, Maya) indicates an expectation that mature, production-grade DCC ecosystems remain foundational even when the end deployment is “real-time.” The “渲染” (rendering) segment includes real-time engines (logos corresponding to Unreal Engine and Unity are visible), which signals a second trend: digital humans are increasingly built for interactive or semi-live use, where real-time rendering and runtime animation systems become part of the “authoring” workflow rather than a separate downstream concern. In practice, this is where pipelines shift from offline VFX conventions to hybrid pipelines that can deliver both film-quality assets and real-time-optimized variants.
The “动作捕捉设备(惯性/光学)” (motion capture equipment, inertial/optical) segment is singled out because performance capture is one of the main cost, realism, and scalability bottlenecks for digital humans. Optical systems (typically marker-based camera volumes) can deliver high-precision body and sometimes face capture in controlled spaces, while inertial systems (IMU-based suits and wearable sensors) trade some accuracy for portability and lower setup cost. The brands shown cover both international and domestic suppliers, including Vicon, Noitom, Xsens, Apple, and Intel, plus a domestic optical/capture brand labeled “青瞳视觉 CHINGMU.” The inclusion of Apple here is best read as “device capture and sensing as an input channel” (phones/tablets with depth sensors, high-quality cameras, and compute), rather than as a dedicated mocap-vendor role. From an industry-chain perspective, performance capture suppliers sit upstream because they determine how cheaply and consistently animation data can be produced, which directly influences whether a digital human remains a bespoke “project” or becomes a continuously updated “product.”
The “IP策划类” (IP planning) block emphasizes that upstream value is not only technical. A digital human that is meant to function as an enduring virtual persona needs character/IP planning: backstory, personality constraints, visual identity rules, content calendar, brand-fit strategy, and governance over how the persona evolves. The logos shown include China Literature (阅文集团) and Brud, pointing to an interpretation of digital humans as managed media IP rather than purely as “AI software.” In many commercialization paths, especially marketing and entertainment, IP planning is the upstream step that determines whether the digital human is treated as a one-off campaign asset or as a multi-year franchise.
The annotation on the right is making two substantive claims about what this upstream layer covers and where competitive advantage sits. First, it frames “virtual digital humans” as an extension of earlier-stage “digital human” work: early phases focus on persona/appearance definition and content production/planning, and later phases emphasize the technical stack of modeling/rigging, driving (animation control, performance input, or AI behavior control), and rendering. Second, it asserts a current division of capabilities: many domestic vendors can handle “形象、语音、语言” (visual likeness/appearance, voice, and language), but some foreign vendors still hold advantages in certain core technologies; the implication is that domestic players are expected to expand outward from their own core technologies over time. Even without accepting that claim wholesale, the practical takeaway for analysis is clear: this upstream layer is where toolchain lock-in, pipeline standards, and performance-capture economics tend to decide who can scale output, who can maintain quality, and who can reduce per-minute/per-asset costs enough to support high-frequency deployment downstream.
If you are using this layer to interpret the full chain map, the most important analytical point is that it mixes three distinct “inputs” that often get conflated: creative production capacity (studios), technical enablement (software/engines), and data acquisition (mocap/sensing), plus a fourth that is frequently underweighted in technical discussions: IP planning. The closer a participant is to controlling repeatable pipelines (standard rigs, reusable asset libraries, scalable capture, and real-time deployment), the more likely they are to shift from project revenue to platform-like revenue; conversely, participants that remain locked into bespoke delivery tend to stay upstream service providers even if their work is essential.
This strip represents the midstream consolidation layer of the virtual digital human industry chain. It shows where upstream inputs such as content production, creation tools, rendering engines, and motion-capture data are integrated into deployable virtual digital human products and services. The strip groups companies by the primary advantage they contribute to integration and commercialization.
The first group, labeled “vertical virtual human vendors” (垂直虚拟人厂商), refers to companies whose core business is virtual digital humans rather than a broader AI or internet platform. These firms typically provide end-to-end delivery, combining character creation, driving and animation control, deployment, and ongoing operations into a packaged offering. Their competitive advantage is usually speed of customization, reliability of delivery, and operational support across common scenarios, rather than owning a single foundational technology that dominates the whole stack.
The second group, labeled “specialist AI vendors” (专长类AI厂商), refers to companies that contribute a specific AI capability that virtual humans depend on. In most implementations, these capabilities concentrate in speech and language, including speech recognition, speech synthesis, conversational NLP, and related tooling. The implication of placing them midstream is that these vendors are commonly integrated as modular components inside a broader virtual human pipeline, where they can be licensed, swapped, tuned, or combined with other suppliers depending on the target industry and compliance constraints.
The third group, labeled “comprehensive / internet technology vendors” (综合类/互联网技术厂商), represents large platform companies that can industrialize deployment. Their advantage comes from owning cloud infrastructure, distribution channels, developer ecosystems, and high-traffic product surfaces, which allows them to package virtual humans as scalable products and enterprise solutions. In this framing, they reduce adoption friction by bundling virtual humans with hosting, tooling, deployment frameworks, compliance processes, and customer acquisition pathways.
The fourth group, labeled “XR/CG vendors” (XR/CG厂商), represents companies whose strengths are in real-time graphics, capture, and production pipelines. They tend to differentiate through visual fidelity, embodiment quality, performance capture, and the integration of virtual humans into interactive experiences such as apps, livestream environments, exhibitions, and branded digital experiences. This group sits midstream because these capabilities often determine the practical “last mile” quality and feasibility of a virtual human deployment.
The note at the right argues that large internet technology companies are structurally positioned to connect and scale the virtual digital human sector. The stated reasons are accumulated technical capability, talent, data, and customer trust built over many years, which can be leveraged to expand into virtual humans. Interpreted analytically, this is a claim about go-to-market power and operational leverage, and it explains why this layer is presented as the main aggregation point where specialist components are assembled into solutions that can be sold, deployed, maintained, and iterated.
This strip represents the downstream application layer (下游) of the virtual digital human industry chain. Its purpose is to show where virtual digital humans are actually deployed, what types of organizations adopt them, and how demand is segmented by industry. Unlike the upstream and midstream layers, which focus on how virtual humans are produced and integrated, this layer is framed around use cases, customer categories, and commercialization pathways.
The panel on the left summarizes the map’s claim about market contribution and scale in 2020. It states that downstream application markets had already reached a “hundreds of billions” RMB level, with finance and pan-entertainment contributing the main shares, and that continued technical progress, expanding demand, and supportive policy would drive further growth. The small bar chart labeled “2020年下游领域市场规模(亿元)” breaks this into application types and assigns approximate sizes to several categories, including financial sector technology investment (the largest bar), RPG-type games, virtual livestreaming, virtual idols/images, and special-effects film. Even if the exact figures vary by source methodology, the structure signals how the authors think value is realized: large institutional spend (finance) and scalable consumer content (games and livestreaming) dominate, while certain categories such as film VFX are shown as comparatively smaller in this specific framing.
The largest center block is “pan-entertainment” (泛娱乐领域), and it is split into three sub-scenarios that reflect different commercialization logics. “Brand spokespersons” (品牌代言) emphasizes virtual humans as controlled marketing assets used for advertising, product launches, and long-term brand association, illustrated by brands such as L'Oréal, KFC, Xiaomi, and Alibaba Group. “Media” (传媒) emphasizes virtual humans as presenters, hosts, anchors, or content personalities that can appear in news, variety, and platform content ecosystems, illustrated by organizations and platforms such as CCTV, People's Daily, Xinhua News Agency, iQIYI, Bilibili, Tencent, Douyin, and Huya. “Games” (游戏) emphasizes virtual humans as characters, companions, NPC-like performers, or promotional figures within game ecosystems, where the value driver is repeatable content and in-game engagement rather than one-off campaign output.
The “culture and tourism” block (文旅领域) points to public-facing cultural institutions and tourism organizations using virtual humans as interpreters, guides, or educational presenters. In this scenario, the virtual human is positioned as a service interface for cultural heritage communication and visitor engagement, often emphasizing standard narration, multilingual delivery, and consistent availability. The logos and seals in this block indicate government or institutional adopters, which implies a procurement pattern closer to public service digitization than to consumer entertainment.
The “finance” block (金融领域) shows banks and major financial-service institutions as adopters, implying a focus on customer-facing service, digital reception, product explanation, and potentially compliance-controlled communication. The presence of institutions such as SPD Bank, Ping An, China UnionPay, Agricultural Bank of China, and China Everbright Bank signals that the map treats finance as a high-spend, high-volume domain where virtual humans can be justified through scale, standardization, and measurable service metrics, rather than purely through novelty.
The “medical” (医疗领域) and “education” (教育领域) blocks depict institutional deployments where credibility, clarity, and repeatability of information matter. In healthcare, the implied role is a virtual assistant or explainer for intake guidance, hospital navigation, patient education, or public health communication, typically under strict constraints on what can be claimed. In education, the implied role is instructional support, campus or platform hosting, or standardized learning content delivery, including state-linked learning platforms and media education brands. Across both domains, the map is presenting virtual humans less as entertainment characters and more as human-like interfaces for information delivery, where the primary differentiators become accuracy, governance, and operational integration rather than visual spectacle.