Technical requirements of large model enhanced digital human system

T/BIA 30-2025 defines a technical framework and requirements set for “large model enhanced” digital human systems, positioning large models as the enabling layer across the full pipeline from character and scene creation (2D/3D generation, textures, scene composition, HDRi/skybox imagery), through animation and control (lip-sync, gesture generation, instruction-driven actions, expression and behavior tagging mapped to outputs), to user-facing interaction (text dialogue, voice, visual and multimodal interaction). It specifies a modular architecture that typically combines modality-specific and multimodal large models with supporting components such as domain knowledge bases, tool libraries exposed via APIs, memory storage, an agent for task planning/execution, and input/output content safety screening, and then translates that architecture into measurable capability expectations and performance targets for quality, latency, accuracy, naturalness, and robustness across each stage, aiming to standardize how such digital human systems are built, evaluated, and accepted for practical deployment.

Front Matter

Standard: T/BIA 30-2025
Title: 基于大模型的数字人系统技术要求 (Technical requirements of large model enhanced digital human system)
Release and implementation date: 2025-09-01
Foreword (编制依据、专利提示、提出与归口、起草单位与起草人)

1 Scope (范围)

Defines technical requirements and evaluation methods for large-model-based digital human systems across modeling, driving, and interaction
Applies to R&D, testing, evaluation, and acceptance of large-model-based digital human development systems

2 Normative References (规范性引用文件)

References YD/T 4393.1-2023 Virtual Digital Human Metrics Requirements and Evaluation Methods, Part 1: Reference Framework

3 Terms and Definitions (术语和定义)

Digital human / virtual digital human (虚拟数字人)
Digital human system (数字人系统)
Large model (大模型)

4 Abbreviations (缩略语)

2D, 3D, NeRF, AI, API, MOS

5 Overall View: How Digital Human Systems Use Large Models (数字人系统应用大模型总体视图)

Lifecycle stages referenced: modeling, rendering, driving, interaction
Large-model application support system modules
- Large model(s) by modality (language, vision, speech, video, 3D, multimodal)
- Domain knowledge base
- Domain tool library (via APIs)
- Memory storage repository
- Agent (task decomposition, planning, execution, continuous improvement)
- Input content safety component (request risk screening / blocking)
- Output content safety component (response risk screening / rewriting / blocking)

6 Technical Framework for Large-Model-Enhanced Digital Human Systems (基于大模型的数字人系统技术框架)

Modeling stage (建模阶段)
- 2D structure generation
- 3D structure generation (text/image prompt to 3D mesh; includes human, attached assets, scene items)
- Texture map generation
- Scene composition
- Scene image generation (HDRi / Skybox)
Driving stage (驱动阶段)
- Lip-sync generation (speech/text to mouth animation)
- Gesture generation (content to speaking gestures)
- Commanded action generation (instructions to actions)
- Expression and behavior generation (structured tags from text generation mapped to facial/action outputs)
Interaction stage (交互阶段)
- Text dialogue
- Voice interaction (ASR/TTS)
- Visual interaction (pose/gesture to intent and response control)
- Multimodal interaction

7 Technical Requirements (技术要求)

7.1 Modeling (建模)
- 7.1.1 2D Structure Generation (2D结构生成)
  - Text-prompt generation of 2D character imagery; prompt elements include background, clothing, hairstyle, accessories, gender; MOS target ≥ 4.0
  - Editing of generated 2D assets (background replacement, outpainting/expansion; advanced edits such as adding/removing accessories, changing clothing)
  - Optional style transfer against selected style templates
- 7.1.2 3D Structure Generation (3D结构生成)
  - End-to-end 3D mesh generation via text-to-3D and image-to-3D models; includes character, attached assets, and scene items
  - Geometry/detail targets (face/torso/clothing), including minimum polygon/vertex and rig detail thresholds; specified geometric error bound
  - Text-to-3D quality targets including render-FID and CLIP I–T thresholds
  - Optional NeRF-based 3D scene synthesis
- 7.1.3 Texture Map Generation (纹理贴图生成)
  - Texture creation using DCC tools and/or large models; outputs include albedo/diffuse, normal, roughness
  - PBR material support with minimum texture resolution (≥ 2K)
  - Optional error detection and auto-correction during texture generation with high accuracy targets
- 7.1.4 Scene Composition (场景组合)
  - Intelligent composition capabilities: asset retrieval, element generation, semantic understanding, logical layout
  - Rapid instruction response; multiple candidate scene compositions (≥ 10) from a scene description; MOS aesthetics target ≥ 4.0
  - Optional multi-turn instruction refinement for layout optimization
  - Import support and/or template library; templates include realistic background/props plus interaction logic aligned to workflow
- 7.1.5 Scene Image Generation (场景图像生成)
  - Generation of 2D unfolded imagery for 3D scenes (HDRi / 360 Skybox)
  - High-resolution targets (at least 8K) and realistic lighting requirements
7.2 Driving (驱动)
- 7.2.1 Lip-Sync Generation (口唇同步生成)
  - Multilingual and multi-dialect lip-sync; lip/voice alignment accuracy > 90%
  - Optional adaptive mouth-parameter adjustment for prosody/emotion changes with accuracy targets
- 7.2.2 Gesture Action Generation (手势动作生成)
  - Natural speaking gestures (≥ 10 types)
  - Video-template-based gesture generation for 2D and animation-based generation for 3D; MOS naturalness target ≥ 4.0
  - Optional text-to-gesture via intent recognition and prebuilt gesture libraries; MOS naturalness target ≥ 4.0
  - Optional music-to-dance generation with beat/rhythm matching targets
- 7.2.3 Commanded Action Generation (指令动作生成)
  - Diverse action generation (stand, sit, walk, point, etc.); command response time < 200 ms; execution accuracy > 98%
  - Optional interaction with scene/objects based on commands
  - Optional complex-instruction parsing into coherent action sequences
- 7.2.4 Expression and Behavior Generation (表情和行为生成)
  - Emotion/expression and action tag generation from narrative output; mapping tags to facial/actions
  - Minimum supported tag counts (emotion and behavior) and high tag-to-output matching accuracy targets
  - Optional automatic syncing/expansion of tag databases
7.3 Interaction (交互)
- 7.3.1 Text Dialogue (文本对话)
  - Multi-scenario dialogue: domain QA with knowledge bases, open-domain QA, multi-turn complex dialogue, persona consistency, sensitive-content detection/filtering, fast response
  - Baseline quality thresholds for accuracy/relevance/naturalness and first-token latency limit (≤ 3 s)
  - NLU requirements: intent recognition, context management, emotion recognition/response, knowledge QA with defined accuracy thresholds
  - Optional personalization, proactive learning, stronger persona/personality consistency, external information retrieval for up-to-date data
- 7.3.2 Voice Interaction (语音交互)
  - Multilingual speech recognition, cloning, translation, synthesis; at minimum Mandarin Chinese and English
  - ASR accuracy and TTS naturalness MOS targets; robustness under noise
  - Context-aware correction to improve semantic understanding with defined accuracy targets
  - Small-sample voice cloning constraints (limited training data and time) with MOS targets
  - Optional selectable timbres by gender/age/role; multilingual timbre synthesis and one-click translated digital-human video
  - Optional open-domain ASR quality, low-latency cloning constraints, accent detection, cross-lingual speech-to-speech translation preserving voice/prosody/emotion, emotion-state inference, speaker verification, emotional TTS
- 7.3.3 Visual Interaction (视觉交互)
  - Pose/gesture (and facial) recognition with end-to-end latency and success-rate targets
  - Optional face-based identity recognition with accuracy targets
  - Optional natural interaction metrics: intent recognition accuracy, eye-contact rate, satisfaction targets
  - Optional multi-person scene target-user identification and gaze tracking
- 7.3.4 Multimodal Interaction (多模态交互)
  - Cross-modal fusion across text, speech, vision; multimodal command parsing accuracy target ≥ 90%
  - Digital human expressive output: facial expressions, actions, gestures
  - Optional continual learning from multimodal interaction data
  - Optional structured “card” presentation of key info while the digital human speaks (data, options, illustrative figures)
  - Optional dialogue filler behavior during waits and robust interruption handling

References (参考文献)

ITU/T F.748.15 Framework and metrics for digital human application systems

Page updated

Google Sites

Report abuse