Publications

Speech Synthesis

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

Benjamin van Niekerk, Marc-André Carbonneau, Julian Zaïdi, Mathew Baas, Hugo Seuté, Herman Kamper

ICASSP, 2022


The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and soft speech units as input features. We find that discrete representations effectively remove speaker information but discard some linguistic content -- leading to mispronunciations. As a solution, we propose soft speech units. To learn soft units, we predict a distribution over discrete speech units. By modeling uncertainty, soft units capture more content information, improving the intelligibility and naturalness of converted speech.


paper

samples


Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Julian Zaïdi, Hugo Seuté, Benjamin van Niekerk, Marc-André Carbonneau

Interspeech, 2022


This paper presents Daft-Exprt, a multi-speaker acoustic model advancing the state-of-the-art on inter-speaker and inter-text prosody transfer. This improvement is achieved using FiLM conditioning layers, alongside adversarial training that encourages disentanglement between prosodic information and speaker identity. The acoustic model inherits attractive qualities from FastSpeech 2, such as fast inference and local prosody attributes prediction for finer grained control over generation. Experimental results show that Daft-Exprt significantly outperforms strong baselines on prosody transfer tasks, while yielding naturalness comparable to state-of-the-art expressive models. Moreover, results indicate that adversarial training effectively discards speaker identity information from the prosody representation, which ensures Daft-Exprt will consistently generate speech with the desired voice. We publicly release our code and provide speech samples from our experiments.


paper

samples

code

General Machine Learning

Measuring Disentanglement: A Review of Metrics

Marc-André Carbonneau, Julian Zaidi, Jonathan Boilard, Ghyslain Gagnon

IEEE Transactions on Neural Networks and Learning Systems , 2021


Learning to disentangle and represent factors of variation in data is an important problem in AI. While many advances are made to learn these representations, it is still unclear how to quantify disentanglement. Several metrics exist, however little is known on their implicit assumptions, what they truly measure and their limits. As a result, it is difficult to interpret results when comparing different representations. In this work, we survey supervised disentanglement metrics and thoroughly analyze them. We propose a new taxonomy in which all metrics fall into one of three families: intervention-based, predictor-based and information-based. We conduct extensive experiments, where we isolate representation properties to compare all metrics on many aspects. From experiment results and analysis, we provide insights on relations between disentangled representation properties. Finally, we provide guidelines on how to measure disentanglement and report the results.


paper

code

Multiple Instance Learning: A Survey of Problem Characteristics and Applications

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, Ghyslain Gagnon

Published in Pattern Recognition, 2018


Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.


paper

supplementary results

code

Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems

Marc-André Carbonneau, Eric Granger, Ghyslain Gagnon

Published in IEEE transactions on neural networks and learning systems, 2018


A growing number of applications, e.g. video surveillance and medical image analysis, require training recognition systems from large amounts of weakly annotated data while some targeted interactions with a domain expert are allowed to improve the training process. In such cases, active learning (AL) can reduce labeling costs for training a classifier by querying the expert to provide the labels of most informative instances. This paper focuses on AL methods for instance classification problems in multiple instance learning (MIL), where data is arranged into sets, called bags, that are weakly labeled. Most AL methods focus on single instance learning problems. These methods are not suitable for MIL problems because they cannot account for the bag structure of data. In this paper, new methods for bag-level aggregation of instance informativeness are proposed for multiple instance active learning (MIAL). The aggregated informativeness method identifies the most informative instances based on classifier uncertainty, and queries bags incorporating the most information. The other proposed method, called cluster-based aggregative sampling, clusters data hierarchically in the instance space. The informativeness of instances is assessed by considering bag labels, inferred instance labels, and the proportion of labels that remain to be discovered in clusters. Both proposed methods significantly outperform reference methods in extensive experiments using benchmark data from several application domains. Results indicate that using an appropriate strategy to address MIAL problems yields a significant reduction in the number of queries needed to achieve the same level of performance as single instance AL methods.


paper

supplementary results

Robust Multiple-Instance Learning Ensembles Using Random Subspace Instance Selection

Marc-André Carbonneau, Eric Granger, Alexandre Raymond, Ghyslain Gagnon

Published in Pattern Recognition, 2016


Many real-world pattern recognition problems can be modeled using multiple-instance learning (MIL), where instances are grouped into bags, and each bag is assigned a label. State-of-the-art MIL methods provide a high level of performance when strong assumptions are made regarding the underlying data distributions, and the proportion of positive to negative instances in positive bags. In this paper, a new method called Random Subspace Instance Selection (RSIS) is proposed for the robust design of MIL ensembles without any prior assumptions on the data structure and the proportion of instances in bags. First, instance selection probabilities are computed based on training data clustered in random sub-spaces. A pool of classifiers is then generated using the training subsets created with these selection probabilities. By using RSIS, MIL ensembles are more robust to many data distributions and noise, and are not adversely affected by the proportion of positive instances in positive bags because training instances are repeatedly selected in a probabilistic manner. Moreover, RSIS also allows the identification of positive instances on an individual basis, as required in many practical applications. Results obtained with several real-world and synthetic databases show the robustness of MIL ensembles designed with the proposed RSIS method over a range of witness rates, noisy features and data distributions compared to reference methods in the literature.


paper

Intelligent Signal Processing

Feature learning from spectrograms for assessment of personality traits

Marc-André Carbonneau, Eric Granger, Yazid Attabi, Ghyslain Gagnon

Published in IEEE Transactions on Affective Computing, 2017


Several methods have recently been proposed to analyze speech and automatically infer the personality of the speaker. These methods often rely on prosodic and other hand crafted speech processing features extracted with off-the-shelf tool- boxes. To achieve high accuracy, numerous features are typically extracted using complex and highly parameterized algorithms. In this paper, a new method based on feature learning and spectrogram analysis is proposed to simplify the feature extraction process while maintaining a high level of accuracy. The pro- posed method learns a dictionary of discriminant features from patches extracted in the spectrogram representations of training speech segments. Each speech segment is then encoded using the dictionary, and the resulting feature set is used to perform classification of personality traits. Experiments indicate that the proposed method achieves state-of-the-art results with a significant reduction in complexity when compared to the most recent reference methods. The number of features, and difficulties linked to the feature extraction process are greatly reduced as only one type of descriptors is used, for which the 6 parameters can be tuned automatically. In contrast, the simplest reference method uses 4 types of descriptors to which 6 functionals are applied, resulting in over 20 parameters to be tuned.


paper

Energy Disaggregation using Variational Autoencoders

Antoine Langevin, Marc-André Carbonneau, Mohamed Cheriet, Ghyslain Gagnon

Accepted in Elsevier Energy & Buildings


Non-intrusive load monitoring (NILM) is a technique that uses a single sensor to measure the total power consumption of a building. Using an energy disaggregation method, the consumption of individual appliances can be estimated from the aggregate measurement. Recent disaggregation algorithms have significantly improved the performance of NILM systems. However, the generalization capability of these methods to different houses as well as the disaggregation of multi-state appliances are still major challenges. In this paper we address these issues and propose an energy disaggregation approach based on the variational autoencoders framework. The probabilistic encoder makes this approach an efficient model for encoding information relevant to the reconstruction of the target appliance consumption. In particular, the proposed model accurately generates more complex load profiles, thus improving the power signal reconstruction of multi-state appliances. Moreover, its regularized latent space improves the generalization capabilities of the model across different houses. The proposed model is compared to state-of-the-art NILM approaches on the UK-DALE and REFIT datasets, and yields competitive results. The mean absolute error reduces by 18% on average across all appliances compared to the state-of-the-art. The F1-Score increases by more than 11%, showing improvements for the detection of the target appliance in the aggregate measurement.


paper


Detection of alarms and warning signals on a digital in-ear device

Marc-André Carbonneau, Narimène Lezzoum, Jérémie Voix, Ghyslain Gagnon

Published in the International Journal of Industrial Ergonomics, 2013

A majority of workers in industrial environments must wear hearing protection devices. While these hearing protectors provide increased safety in terms of auditory health, in some conditions they also have the adverse effect of preventing individuals from hearing alarm and warning signals which seriously compromises their safety. Recent advances in the field of microelectronics allow the integration of tiny digital signal processors inside hearing protection devices. This paper develops new algorithms to automatically detect alarm signals in the digitized audio stream fed to the processor. This detection is performed in real-time with low latency to quickly inform the user of a dangerous situation. The algorithms were also optimized to require low computational resources due to the limited processing power of typical embedded electronic devices. The proposed algorithms detect periodicity of the signal amplitude in a determined frequency bandwidth. The system was simulated with a database of alarm signals from a major North-American manufacturer of industrial alarms and warning signals, mixed with typical environmental noises at signal-to-noise ratios ranging from 0 to 15 dBA. The results show an average true-positive recognition rate of 95% for pulsed alarms compliant to the ISO 7331 standard. The system can be optimized for specific alarms which results in near 100% true positive and 0.2% false positive recognition rates.

paper

Computer Vision & Graphics

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, Marc-André Carbonneau

Under Review, 2022


We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles.


paper

code

Artist guided generation of video game production quality face textures

Christian Murphy, Sudhir Mudur, Daniel Holden, Marc-André Carbonneau, Donya Ghafourzadeh, André Beauchamp

Published in Computers & Graphics, 2021


We develop a high resolution face texture generation system which uses artist provided appearance controls as the conditions for a generative network. Artists are able to control various elements in the generated textures, such as the skin, eye, lip, and hair color. This is made possible by reparameterizing our dataset to the same UV mapping, allowing us to utilize image-to-image translation networks. Although our dataset is limited in size, only 126 samples in total, our system is still able to generate realistic face textures which strongly adhere to the input appearance attribute conditions because of our training augmentation methods. Once our system has generated the face texture, it is ready to be used in a modern game production environment. Thanks to our novel SuperResolution and material property recovery methods, our generated face textures are 4K resolution and have the associated material property maps required for raytraced rendering.


paper

Real-time visual play-break detection in sport events using a context descriptor

Marc-André Carbonneau, Alexandre J Raymond, Eric Granger, Ghyslain Gagnon

published in IEEE International Symposium on Circuits and Systems (ISCAS), 2015


This paper presents a two-stage hierarchical method for play-break detection on non-edited team sports video feed. Unlike most existing methods, this algorithm uses modern action and event recognition method thus does not rely on production cues of broadcast feeds, but instead concentrates on the content of the video. Moreover, the method does not require player tracking, can be used in real-time and can be easily adapted to different sports. In the first stage, bag-of-words event detectors are trained to recognize key events such as line changes, face-offs and preliminary play-breaks. In the second stage, the output of the detectors along with a novel feature based on the number of detected spatio-temporal interest points are used to create a context descriptor. The final classification is performed on this context descriptor. Experiments demonstrate the benefits of using this context descriptor by reducing the frame classification error by 18% when compared to the baseline method. The efficiency of the proposed method is demonstrated on a real hockey game (accuracy over 88%).


paper

code

data