Facial animations are an important detail in the performance of a character but it is also an expensive and time consuming feature to implement.
It has become viable to capture faces together with full performance capture and it’s giving some excellent results in many games.
However in games with conversation systems and a lot of lines of dialogue this is not quite a solved problem and we relied on procedural systems to analyze the audio and text to generate facial animations.
In an early game we invested in a facial animation system that was engineered by a programmer on our team. I couldn’t speak to it too much but it behaved comparatively to other systems at the time.
What I’ve used since then is mostly a middleware called FaceFX. Choosing to use middleware for facial animations is generally a cost and time question. If you can afford to capture/animate all the facial animations you need that will give you the best quality.
Middleware comes in when you need a solution that can scale well. However from my experience when using middleware the best results from that middleware doesn’t come directly out of the box. You need to do more and this is something I found that other teams also did.
The thing that we did was that for the important cinematics you needed the animators touch to get the best quality and then for conversations the FaceFX generation was used together with some annotations in the lines of dialogue and in important conversations where there was no time for the animators we exported and polished the FaceFX generated animations.
First of all to use middleware you need to learn how to use it and what the best approach is.
The facial animation generation by FaceFX generates weights for phonemes and the phonemes are connected to joints or blend shapes. We initially did the mistake of creating a very complex mapping table that generated keys for each individual shape that the phonemes mapped to. This created two problems, very large animations as we had tracks for every shape rather than every phoneme and we got a lot of facial jitter as the shapes didn’t blend together as well as they could.
Fixing that we addressed issues with the size of the animations and we solved some of the jitter.
There is also an additional problem that came in with the version of FaceFX that we used and it’s that there are many shapes that fight each other when they are keyed next to each other and what you can do for that is to replace two keys with a combined shape. This solved even more of the jitter.
As I mentioned before we also added support for injecting facial animations into the generated facial animation with tags. FaceFX came with support for injecting emotion shape tags but this gave very limited quality.
Bioshock Infinite and Spec Ops also used FaceFX and while we shared some of the same solutions Bioshock had a solution for the emotions that I think in hindsight was a better solution due to the memory savings and that was to allow two facial animations to play at the same time. One for the emotion and one for the dialogue animation.
Last gen hardware (PS3 especially) meant not so much memory. Which meant we had to cut back on memory usage for every system. While FaceFX is quite good in that it doesn’t introduce too many additional keys one problem with it was that it used 32 bit floats for the time, value, tangent in and out. A common technique for compressing animations is to use fixed point compression where you convert a float to a smaller data type and since we knew that the key value and the two tangent values were in a small range we could easily compress that (and even eliminate the tangents if we so required and instead approximate them intelligently).
We also did additional analysis of the facial animations and reduced empty key tracks, that saved additional memory.
Then and this is a common problem with middleware that is integrated into Unreal and I assume other engines. The integration usually isn’t as high quality as the original middleware and we had to make sure to avoid duplication of memory usage. This was also a significant saving.
The element that puts quality into a generated facial animation performance is the additional tagging for emotions and eye tracking. This was something I also learned was key in the Bioware facial animation system at the time and was used quite recently. However if you didn’t spend the time to add those additional tags the quality was greatly reduced.
The second element is to show reactions in the conversation system, it’s important to have a few different types of cameras and there are other people who can explain that better than me but the element of showing the reaction to a line of dialogue is something you always see in TV shows for a good reason. We didn’t do that directly in The Bureau but I believe Mass Effect had it.
The third part is all in the eyes, you never ever want to show dead eyes or a blank face. Keep them active even while not delivering a line of dialogue. Keep the eye line. We allowed the full body animations to control the eye line using a joint as the target.
The fourth part is in a gesture system, people do not stand still while they are talking or listening, they move, they change their stance and gesture as they talk.
The last part is entirely technical, when you trigger audio you cannot just trigger the facial animation at the same time. You need to offset it to cover the time it takes to present the facial animation on the screen. Rendering adds several frames to the time you start the animation. The offset is entirely dependent on your rendering engine and performance of the hardware. Music games will even let the user control this latency since it is sadly impossible for an engine to know about the display latency.
Like every production pipeline we’ve seen optimizations coming in to allow capturing of data more efficiently and the same thing has happened with facial animations and facial capture. If you can’t afford to or don’t have the time for the full performance capture you could do facial capture for the voice actors as they are recording the lines of dialogue.
I believe that for large amounts of content this is a great case for machine learning, you train it with captured facial animations and the associated audio. Then from that you feed it audio to generate facial animations in the same style. I’m not an expert here but this is similar to what machine learning is used for to solve a lot of other problems.