SF-CRL: Speech-Facial Contrastive Representation Learning for Speaker Feature Extraction
SF-CRL: Speech-Facial Contrastive Representation Learning for Speaker Feature Extraction
Experience in Multimodal Representation Learning
Learned the correlation between speech and facial images to capture the speaker's unique voice characteristics.
Explored the possibilities and applications of multimodal representation learning using diverse modality data.
Enhanced Understanding of Contrastive Learning
Gained a deeper understanding of the core concepts and practical applications of contrastive learning by optimizing information sharing and feature extraction between speech and facial images.
Integrating various modalities, such as speech, text, and images, enables richer information representation, which is useful in applications like speaker recognition and biometric authentication.
Existing studies primarily focus on the sound itself rather than the human voice or rely on specific modalities, which limits the extraction of individual features.
Speech characteristics are closely related to the speaker's face anatomical features, and leveraging this correlation allows for more accurate extraction of speech characteristics.
Develop a contrastive learning model, SF-CRL (Speech-Facial Contrastive Representation Learning), to effectively capture the speaker's voice characteristics by combining speech and facial images.
Enhance the accuracy of speaker recognition and biometric authentication by complementing voice characteristics with features extracted from facial images.
Overview of SF-CRL
Feature Extraction
Use separate encoders to extract features from speech and facial images.
Process mel spectrograms as input for speech and RGB frames for facial images.
Auxiliary Feature Matching
Utilize Resemblyzer and FaceNet to ensure the quality of speech and image features, guiding accurate feature learning.
Contrastive Learning
Learn the information of both modalities complementarily through contrastive learning between speech and facial images.
Audio-Visual Loss
Apply cross-modal matching loss to ensure the alignment of speech and image features.
Feature Matching Loss
Evaluate the quality of speech and image features by comparing them with the pre-trained models of each modality, guiding the learning process.
LRS3 Dataset: Thousands of English sentence data extracted from TED talks.
GRID Dataset: Image and speech data from 33 speakers collected in an experimental environment.
Speaker Similarity: Evaluates how consistently the characteristics of the same speaker are recognized across various utterances.
Mean Average Precision (mAP): Assesses the ability to accurately retrieve speaker images based on voice features.
Baselines
Resemblyzer: An open-source library used for voice analysis tasks. It converts voice samples into high-dimensional vectors to extract voice representations, enabling voice embedding generation, voice similarity comparison, and speaker identification.
Wav2Vec 2.0: A model that extracts voice features using self-supervised learning, pre-trained on many unlabelled data.
AudioCLIP: Performs tri-modal contrastive learning among images, speech, and text. It was trained on the ESC50 dataset and combines the text and image head weights used in CLIP with the audio head of ESResNet.
Similarity & mAP
The SF-CRL model demonstrates superior performance compared to existing models such as Resemblyzer, Wav2Vec-v2, and AudioCLIP on the LRS3 and GRID datasets.
In particular, it maintains feature consistency across various utterances of the same speaker and performs excellently in speaker recognition and speech-image matching.
Visualization
Embeddings of the same color cluster together, allowing for effective distinction between individual speakers.
Distinct clusters are formed based on gender, enabling high-accuracy recognition of speaker characteristics.
Clusters of different colors are clearly separated, maintaining speaker identity across various utterances.
The encoder emphasizes specific facial areas that are important for identifying the speaker.
These areas include the eyes, nose, and mouth, which have a high correlation with speech.
Ablation study
Results without using feature loss or MSE loss all demonstrate lower performance compared to the proposed model.
The proposed model architecture effectively extracts voice features using the speaker's image.
Demonstrated the effectiveness of unique voice feature extraction by learning the correlation between speech and images using contrastive learning.
Exhibited excellent generalization performance across various domains, confirming potential applications in speaker recognition and biometric authentication.
There is a need to enhance the model's versatility through training in diverse languages and facial expressions.
Additional research is required to improve facial image performance across various angles and environments.
Future research will aim to integrate the ability to recognize emotions and context, targeting a more detailed extraction of speaker characteristics.