Face-to-Voice: Face-based personalize multimodal Text-to-Speech synthesis model
Face-to-Voice: Face-based personalize multimodal Text-to-Speech synthesis model
Exploring New Possibilities for Text-to-Speech
Overcame the limitations of traditional Text-to-Speech (TTS) systems by developing a model that generates personalized voice based on facial images without the need for voice samples, confirming the expansion possibilities of various TTS systems.
Enhanced Ability to Model and Improve New Ideas
Proposed a new TTS model, Face-based Voice Synthesis for Text-to-Speech (FVTTS).
Demonstrated performance improvements compared to existing models, enhancing the ability to model and improve innovative ideas concretely.
Speech synthesis technology is widely used in fields such as public announcements, smart speakers, audiobooks, and voice task processing, and it plays a crucial role in voice-assisted services and education.
While existing speech synthesis technologies have developed models to read text in various speakers' voices, there is an increasing need for technology that provides personalized voices for users.
Efficient technology is needed to create a new speaker's voice for personalized speech synthesis without additional voice data.
Propose a new multimodal speech synthesis technology that analyzes the correlation between facial images and voice
Develop personalized, multimodal speech synthesis technology based on facial images.
Build a personalized voice generation system applicable to various environments.
Need to verify whether it is possible to generate diverse voices for different speakers using facial images.
Need to confirm whether it is feasible to generate new speaker voices solely from images without voice samples.
In most previous studies on facial image-based voice generation, facial features were not directly extracted from the images. Instead, image features extracted through an image encoder were made similar to voice features extracted from a pre-trained voice encoder.
Propose a model capable of synthesizing various speakers' voices based on features directly extracted from facial images.
Face Encoder structure
Extract speaker information from images.
Global Encoder: An encoder for extracting overall information from the image.
Personalized Encoder: An encoder for extracting the speaker's individualized voice information.
Learnable Weight: Trains a weight to perform a weighted sum of the features extracted from the two encoders.
Face-to-voice model structure
Face encoder structure
LRS3 result
We achieve a score of 0.754 in SECS, confirming the generation of the same speaker's voice with various text data.
Evaluated the model's ability to generate more natural-sounding speech and voice matching with the face image through MOS (Mean Opinion Score).
Visualization
Generated results could be distinguishable for a variety of voices of each speaker compared to existing models.
The proposed model synthesizes voices that clearly distinguish between genders.
We achieve the highest MOS-T in terms of intelligibility, confirming the consistency between text and speech.
We verifiy the naturalness of speech synthesis performance through the highest score in naturalness.
In the case of the animation dataset, the results show a preference rate of 98.93% compared to existing models in preference evaluations.
GRID dataset result
Animation dataset result
Face Encoder Visualization
Enc_g emphasizes the contours of the face and learns overall image features.
Enc_p learns personalized features, focusing particularly on the central areas of the face related to pronunciation, such as the nose and lips.
Confirmed that each encoder effectively extracts global image features and voice-related personalized features.
Learnable weight
During the training process, w_face shows an increasing trend while w_img shows a decreasing trend.
This indicates that the role of Enc_p in learning personalized voice features becomes increasingly important.
Qualitative Result
The spectrograms of the generated voices maintain differences in pitch and timbre for each speaker.
This suggests that individual features learned from the facial images are reflected in the voices, resulting in unique sound profiles.
We propose a new TTS model that generates speech directly from facial images.
We introduce a Face Encoder structure that combines global facial features with personalized voice features.
Our model achieves consistent voice synthesis across different images of the same speaker through learned facial features, recording excellent performance in naturalness and text consistency.
We verify the applicability of the facial-based TTS model across various data domains through experiments.
Future research aims to enhance emotional expression and gender recognition for more refined speech synthesis.