The Role of Vocal Persona in Natural and Synthesized Speech (Conference Paper, 2022)
A Design Framework for Expressive Voice Synthesis (Late-Breaking Demo Abstract, 2021)
Poster presented at the Late-Breaking Demo Session of the 22th International Society for Music Information Retrieval Conference, November 2021.
This design proposal is the first step in a larger project. We proposes a design framework for interactive, real-time control of expression within a synthesized voice. Within the framework, we propose two concepts that would enable a user to flexibly control their personalized sound. The voice persona that determines the “tone of voice” is defined as a point existing within a continuous probability space. This point defines the parameters that determine the distribution space of the latent features required for synthesis, allowing for flexible modification and fine-tuning. Secondly, expression within a persona can be achieved through modification of meaningful high-level abstractions, which we call macros, that sub- sequently modify the distribution space of corresponding latent features of the synthetic speech signal.
In music and speech, meaning is derived at multiple levels of context. Affect, for example, can be inferred both by a short sound token and by sonic patterns over a longer temporal window such as an entire recording. This project explores what kind of semantic meaning one can infer from learning this dichotomy of contexts. We show how contextual representations of short sung vocal lines can be implicitly learned from fundamental frequency (F0) and thus be used as a meaningful feature space for downstream Music Information Retrieval (MIR) tasks. We propose three self-supervised deep learning paradigms which leverage pseudotask learning of these two levels of context to produce latent representation spaces. We evaluate the usefulness of these representations by embedding unseen pitch contours into each space and conducting downstream classification tasks. Our results show that contextual representation can enhance downstream classification by as much as 15% as compared to using traditional statistical contour features.
A "slot-filling" context-learning scheme implemented by a neural network. Idea: Learn adjacent context in order to fill in the gap in the middle of a short sung phrase.
Short Paper: ICML-Machine Learning for Musical Discovery Workshop (2019)
(Right) 2D similarity plot between country labels of singer's reported country given audio features of their solo singing performance. 10 classes of equal representation are shown.
This project sought to study characteristics of regional language accent in solo singing using convolutional neural networks to classify the country and language of singers performing a karaoke-style rendition of the popular hymn, Amazing Grace. The models appeared to learn and separated predicted countries along a rhythmic-stress dimension, with English variants at the origin of the dimension. These observations suggest that, based on the network’s success in learning intonation features, a singer’s speech pronunciation adapts to the language of the song being sung.
To evaluate a single performance, coders rated five as- pects of a singer’s voice while listening to the recording: estimates of age, gender, skill level (ranging from unskilled to highly skilled), how likeable (ranging from not like-able at all to very likeable) and how expressive the performance was (rang- ing from not expressive at all to very expressive).
On average, listeners gave performers a skill score of 3.35 (where 1 is “unskilled” and 7 is “highly skilled”). Coders gave performers an average likability score of 3.63, and the expressiveness average was 3.36. Interestingly, listeners seemed to find vocal performances more likeable than skilled or expressive, on average. These results are skewed to smaller values, as coders rarely gave a rating of 7. Vocal performers were deemed “highly skilled” only 18 times of the 1,600 evaluations, “very likeable” 23 times, and “very expressive” 22 times. Notably, the perceived skill of a performer and expressiveness of their performance have a correlation of 0.83. Meanwhile, the correlation coefficient of perceived skill and like-ability was 0.73, and the coefficient of likeability and expressiveness was 0.72.