NRTA Technical Standards Draft for Digital Virtual Humans

The National Radio and Television Administration (NRTA) of China has released the industry standard draft for "Technical Requirements for Digital Virtual Humans," applicable to the broadcast, television, and online audiovisual sectors. The document outlines specifications for the classification, application scenarios, appearance, driving technology, platform capabilities, and security of digital virtual humans. Developed following industry standard-setting procedures, the draft has been reviewed by the National Radio, Film, and Television Standardization Technical Committee and is now open for public consultation. Objections can be submitted during the public review period (November 15–24, 2024) with supporting evidence, personal identification, workplace details, and contact information.

Overall Architecture

The architecture of digital humans is categorized by representation (2D or 3D), interaction type (interactive or non-interactive), and driving methods (algorithm-driven or real-person-driven). Application scenarios include content broadcasting (e.g., news, sign language, live-streaming), interactive customer service (e.g., virtual assistants, Q&A systems), virtual performances (e.g., concerts, user avatars), and content creation (e.g., film, video, advertising, games). The technical framework integrates several components: visual appearance (realistic or stylized), algorithm-driven capabilities (text, speech, video inputs), real-person-driven functions (motion and facial expression capture), platform functionalities (creation, maintenance, deployment), and security measures (data and privacy protection). The architecture ensures adaptability and scalability for diverse uses in the audiovisual sector.

Appearance Requirements

Appearance requirements for digital humans focus on visual realism, technical accuracy, and aesthetic appropriateness. General guidelines include ensuring the character aligns with its role and scenario, maintaining smooth animations, and preventing technical issues like distortion or delays. 2D representations are divided into realistic (accurately replicating human features) and cartoon styles (dynamic and lively). 3D models must meet high standards for realism or stylization, supporting detailed textures (e.g., skin, hair, facial features) and dynamic lighting effects. Models require seamless topology and compatibility with different tools, enabling accurate physical simulation for accessories like hair and clothing. These requirements ensure that digital humans are visually cohesive, engaging, and technically functional.

Algorithm-Driven Capabilities

Algorithm-driven capabilities enable digital humans to respond and interact through various inputs, including text, speech, and video. Text-driven capabilities support models like HiFi-GAN and DurIAN, enabling emotional and contextual understanding for generating synchronized voice, gestures, and expressions. Speech-driven capabilities include real-time detection and elimination of background noise, dynamic voice activity detection, and support for multi-emotion voice synthesis. Video-driven capabilities leverage computer vision to map facial expressions and body movements captured in real-time or offline. Multi-modal synchronization ensures accurate alignment of expressions, gestures, and voice with user inputs, enabling natural and lifelike interactions. The framework also supports customizable synthesis for specific scenarios such as broadcasting or customer service.

Real-Person Driving Capabilities

Real-person driving capabilities involve motion capture systems that translate body movements and facial expressions into digital animations. Optical motion capture uses infrared or laser technologies to track reflective markers, while inertial capture employs sensors like accelerometers and gyroscopes to record motion. Visual motion capture uses cameras to track movement based on defined markers or patterns. Facial expression capture integrates sensors or markers to map key features such as muscle movement, wrinkles, and expressions onto a virtual model. Data handling emphasizes real-time compatibility, high accuracy, and seamless animation reproduction. These capabilities ensure digital humans mimic real-life movements and emotions naturally and fluidly.

Platform Capabilities

Platforms for digital humans must support character creation, content production, and service configuration. They should handle tasks such as asset management, customization of appearance and voice, and interactive functions like gesture and conversation management. Platforms can deploy on public or private clouds or locally, supporting integration with various devices (PCs, mobile devices, large screens) and formats (web, apps, or mini-programs). Tools for synthesizing and driving digital humans include multi-modal AI algorithms, real-time video streaming, and customization options for voice, attire, and animations. Platforms must also support integration with third-party digital assets and provide scalable deployment using microservices and distributed databases.

Security Requirements

Security measures for digital humans focus on protecting data, algorithms, and user privacy. Data must be collected and used within legal and regulatory bounds, with encryption and access controls safeguarding sensitive information during storage and transmission. Algorithms require secure implementation to prevent misuse or manipulation, with restrictions on creating false or harmful content. Personal information protection mandates transparency and user consent, especially for biometric data like facial scans or voice samples. Mechanisms must be in place to ensure ethical and lawful handling of sensitive data, aligning with privacy regulations and fostering user trust.