ASAP

Auto-generating Storyboard And Previz 

Hanseob Kim      Ghazanfar Ali      Bin Han     Hwangyoun Kim    Jieun Kim      Jae-In Hwang

Korea Institute of Science and Technology

Abstarct

We present ASAP, a system that uses virtual humans to Automatically generate Storyboards And Pre-visualized scenes from movie scripts. In our ASAP system, virtual humans plays the role of actors. To visualize the screenplay scene, our system understands the movie script, which is the text data, and then supports the automatic generation of the following virtual human's non-/verbal behavior: (1) co-speech gesture, (2) facial expression, and (3) body movements. Using dialogue paragraphs, co-speech gestures are generated through text-to-gesture model trained with 2D videos and 3D motion-captured data. Next, for the facial expressions, we interpret the actors' emotions in the parenthetical paragraphs and then adjust the virtual human's face animation to reflect emotions, such as anger and sadness. For body movements, our system extract action entities from action paragraphs (e.g., subject, target, and action) and then combines sets of animations to make animation sequences (e.g., a man's act of sitting on a bed). As soon as possible, ASAP can reduce the amount of time, money, and labor-intensive work that needs to be done in the early stages of film making.

Figure 1, Caption: The photograph of our tool, ASAP's graphical user interface.  When a user uploads a script written using Final Draft software, our system parses the script to simulate visual scenes with virtual humans. The simulated scenes are displayed in the multi-camera view, and the camera's position/orientation is controlled by the keyboard and mouse.  The buttons on the bottom-right corner of the screen allow the user to observe automatically generated 3D animated scenes, or pre-visualizations (previz), for each paragraph or entire script.  Not only that, the user can capture the currently played scenes and then export them as a storyboard.

Technical background

Virtual environments (VEs) can be an excellent place to "automatically" simulate visual scenes. Moreover, digital humans (DHs) are human-like characters that can replace human actors in VE, even in real environments. Thus, we present ASAP, an applicable system automatically generating DH's behaviors by understanding text data to simulate the 3D visual scene instantly.

At our system's core is two modules, i.e., Gesture and physical action generation. The gesture generation system is an automatically generated rule-based system. We generated text-to-animation rules by extracting transcript and 2D poses from a large corpus of public videos using speech recognition and pose estimation, respectively. The poses are usually noisy and not usable for 3D DHs. So we trained a contrastive learning model to map 2D poses to 3D gestures accurately. We used motion capture to get functional 3D gestures.

The physical action module uses a BERT-based model that extracts action entities (e.g., subject, object, and action) from the given text. We prepared and trained a custom dataset with simple to confusing sentences and corresponding entities. Using the extracted action entities, we developed a synthesizer that generates sequential animations for simulating the DH's physical actions in VE.

The pipeline of the ASAP system consists of three steps. Firstly, we ask the user to upload a script in the most common final draft (XML) format. Then our system parses it into Action, Character, Parenthetical, and Dialogue to understand and simulate their story in the VE with DHs. Secondly, action entities are extracted from action paragraph, and animations are synthesized. Lastly, co-speech gestures and facial emotions are generated with dialogue paragraphs. Also, it generates DH's body actions by combining delivered entities and a set of simple animations with virtual objects. Thus, the tool's outputs are complete with a 3D animated/visual scene.

The overview of system process

Figure 2, Caption:

(a) The uploaded script written in Final Draft is parsed into Action, Character, and Parenthetical Dialogue by our system. 

(b) The Action paragraph is sent to the entity extractor to prepare for the creation of physical actions. 

(c) Other paragraphs are sent to the animation synthesizer and generate co-speech gestures with the voice synthesizer.  

     Then, using the action entities passed from the extractor, our system combines simple animation set with virtual object sets to generate the character's physical 

     (or body) animations.

(d) Lastly, users can observe the pre-visualized (or previz) animation and create a storyboard by capturing the simulated scenes.

Gesture Generation 

This module takes text as input and generates a corresponding animation sequence using text-to-gesture mapping. The Process is detailed in Figure(3). To generate co-speech text-to-gesture mapping for 3D digital humans, we obtained text and 2D pose data from public monologue videos. Gesture units were obtained from motion capture sequences. The method works by matching 2D poses to 3D gesture units. We trained a model via contrastive learning to improve the matching of noisy pose sequences with gesture units. To ensure diverse gesture sequences at runtime, gesture units were clustered using K-Mean clustering.

Figure 3, Caption: System overview. A) Gesture unit extraction and gesture clustering. B) 2D Pose-Text pairs extraction C) Rulemap generation by matching pose and gesture units with GestureCLR.

Physical Action

This module uses the extracted action entities from the text to generate DH's physical actions (or body movements). For that, we prepared simple character animation clips from Mixamo and inverse kinematic animation tool in advance. We also keep the interactable object features (e.g., type, name, position, direction, etc) in the prepared VE. Then, using a rule-based approach that utilizes extracted action entities as a parameter, we combine animation sets with virtual objects to synthesize animation sequences as DH's physical actions. This rule can be divided into two main parts: 1) approaching the target, and 2) acting with the target. For example, Figure 4 (e) is `The man turns on the light'. At this point, the man (i.e., DH) needs to be the first to move toward the target object. To position the DH at the target point, the first rule exploits the information of the target object (e.g., position, direction) and connects the locomotion clips (e.g., walk, run, rotation). When a DH arrives near a target, the second rule executes an action clip (e.g., pushing the button) that matches the type of target and action. Thus, users can observe the serialized animations that correspond to their scripts (i.e., author's intentions).

Figure 4, Caption: Automatically generated digital human behavior according to the user contexts: (a) initial state, (b) lay on a bed, (c) bring a pillow, (d) open the drawer, (e) turn on a lamp, (f) open the window, (g) turn off the light, and (h) sit on a chair.

Example of Action Entities 

These entities are only for this demo video 

Facial animations

Our system provides a function that can adjust DH's facial expression. This system contains seven main emotional animations (Angry, Disgusted, Fear, Happy, Sad, Surprised, and Neutral). Additionally, users can adjust emotional intensity. There are three levels of emotional intensity (strong, medium, and low). The screenwriters can designate emotions to the Parenthetical paragraph of the script. The system can analyze the Parenthetical paragraph and extract the emotion type and intensity information in a rule-based manner. Consequently, screenwriters can use this module to control the characters' facial expressions and emotions easily.

Angry

Disgusted

Fear

Happy

Sad

Surprised

Sad animation
Angry animation

Previous our related works