1. Introduction
Humans can manipulate the complex interaction between their lungs, vocal tracts, and vocal founds in lots of different ways.
As a result, the singing voice can be astoundingly expressive and singers can ornament their works for example by using glissando (slowly changing in pitch from one note to the other) or vocal fry ("raspy"-sounding distortion particularly in the lower pitches, also occurring at the start and end of words during speech).
Here we focus on phonation mode as one of these expressive elements.
According to Sundberg [1], there are four different phonation modes: "breathy", "neutral", "flow" and "pressed", which arise from a combination of subglottal pressure (pressure below the vocal folds) and the amount of glottal airflow (how much air passes the vocal folds and eventually exits through mouth and nose), as seen in the figure to the right.With low glottal airflow, neutral phonation arises with low subglottal pressure, and pressed phonation with high subglottal pressure.
Breathy phonation is a result of high glottal airflow and low subglottal pressure, while flow phonation features higher subglottal pressure in comparison.
Phonation mode can be considered as an additional expressive dimension along with pitch and loudness, giving singers more freedom to express certain emotions and also develop their own style.
But phonation mode also has applications in emotion detection in speech, as it is one of the features that could be useful for determining the emotional state of a speaker.
The physiological parameters involved in creating these phonation modes have been reasonably well studied: One study in particular found that the laryngeal pressure as the ratio between subglottal pressure and average airflow is significantly different in each phonation mode [2], as table 1 to the right shows. Unfortunately, these physiological parameters are usually directly measured from the body and are difficult to estimate when only an audio recording of the voice is available.In this project, I aim to construct a classifier capable of distinguishing the different phonation modes automatically only based on audio recordings by extracting suitable audio features.
2. Dataset and related work
Because phonation mode is a rather specific property of the singing voice, only one publically available dataset [3] was found that contained annotations describing the phonation mode, indicating a critical lack of data for further research in this area.
The dataset includes approximately 900 recordings in total, each containing a single vowel sung by a female professional in one of the four phonation modes breathy, neutral, flow and pressed. The recordings are processed so only minimal silences before and after the singing part is present. Recordings vary in pitch between A3 and G5 and in the vowel sung between the nine different vowels A, AE, I, O, U, OE, UE, Y and E.
However, not all phonation modes could be produced by the singer in the higher pitch range, leaving only neutral and breathy above B4. To obtain a balanced dataset, where phonation modes are featured equally for every combination of pitch and vowel, only recordings with pitches between A3 and B4 are retained. Additionally, I exclude all recordings that are only alternative variants of recordings with the same vowel, pitch and phonation mode.
A first attempt at automatic classification was performed on this dataset by the same authors that published it [4], which reached commensurate accuracies between 60% and 75% when trained on every vowel individually. Their method uses inverse filtering to estimate the spectrum of the glottal source (describing the vibrations of the vocal folds) and the formant frequencies. This is a physiologically inspired approach, because the spectrum of the glottal source should enable the estimation of pressedness and breathiness by analysing how the vocal folds open and close, therefore trying to extract the required information directly at its source. Additionally, they presume that for this reason, simpler spectral descriptors will likely not work, mentioning particularly the Mel-frequency cepstral coefficients, which will be introduced in the next section as I will use them to build a working classifier to disprove this assumption.
3. Feature extraction
For this project, Mel-frequency cepstral coefficients (MFCCs) are used, because they are widely and successfully used in many areas of music information retrieval as a general descriptor of timbre.
The main reason for this choice is the idea that if humans can perceive differences in phonation mode just by listening, there must be a spectral feature that describes these differences that does not rely on estimating the glottal source waveform, and timbre is very likely responsible for these perceived differences.
MFCCs are computed on a frame-by-frame basis, meaning the audio is temporally divided into a number of frames, before the spectrum is calculated using the fourier transform.
Then, the powers on this spectrum are summarised along a number of N bands according to the perceptually based Mel scale, before they are log-transformed.
Arguably the most critical step is the following discrete cosine transformation (DCT) of the resulting spectral envelope, which yields coefficients that describe the shape of the spectral envelope, called Mel-frequency cepstral coefficients.
Because the lower coefficients capture most of the energy and therefore information in the spectral envelope, typically only the first 13 MFCCs are used in most applications. Here however, I analyse the first 40 coefficients to identify those important for phonation mode detection. In addition to the standard setting of N=40 bands normally used, which I will call MFCC40B, I extract a variant called MFCC80B using N=80 bands in the hopes of attaining better classification results. This is reasoned by the idea that the resulting higher resolution could be beneficial particularly for the description of he lower frequencies of the spectral envelope, which are presumed to be especially important for the discrimination of phonation modes. Both features contain the 0-th coefficient, which equates to the energy present in the signal.
After the MFCCs for each frame in the signal using a window length of 50 ms with 50% overlap are calculated, the average over all frames is calculated, yielding a single one-dimensional MFCC vector for each recording.
4. Feature analysis
While MFCCs work for many tasks in music information retrieval, they are difficult to interpret intuitively, often leaving the researcher wondering why the constructed system is successful. In an attempt to alleviate this problem, we visualise MFCC40B to find patterns in the coefficients by visual inspection. The goal is to find out whether MFCCs could be a suitable feature for classifying phonation modes. A good feature for classifcation should have two properties: It should be relevant, meaning it should separate the defined classes in feature space, but also robust against all other possible influences, meaning it does not change in value when other attributes of the input change with the exception of the class.
Firstly, each MFCC is normalised across the whole dataset to have zero mean and unit standard deviation, unifying the range of values so that one colormap can capture the fluctuations of every MFCC, and making it easier to compare the values of a single MFCC when calculated for different recordings.
The resulting normalised MFCC40B features are visualised in the figure to the right. The individual 40 coefficients for a recording are plotted along the vertical axis. These columns describing the MFCC40B feature for each recording are aligned along the horizontal axis. Note that the recordings on the horizontal axis are sorted according to the four different phonation modes. Additionally, the recordings are sorted in an ascending order of pitch within each phonation mode. Feature values are indicated by their colours.
This visualisation leads to a number of interesting findings. The most visually striking is the presence of a number of diagonal lines "from the lower left to the upper right" which are parallel to each other. These occur for every phonation mode in the coefficient numbers 9 and upwards and imply that their values are dependent on the pitch of the recording (inferred by the sorting according to pitch). Because of this lacking robustness against changing pitch, these coefficients could be suboptimal for a subsequent classification task. Another observation is that the lower coefficients tend to vary in their averages when compared across different phonation modes. The 0-th coefficient, indicating the average energy in the recording, seems to be especially uniform within and very different between phonation modes, meaning it could be a good feature for classification. More specifically, loudness tends to increase when going from breathy to neutral to pressed to flow, which is confirmed by intuition and listening to the recordings.
Breathy phonation in particular stands out with its low 0-th, high 1st, 2nd and 4th coefficients compared to other phonation modes.
5. Classification
As classifier, I use feed-forward neural networks (NNs) due to their robustness against noise. The NN has one hidden layer containing a variable number of N neurons using the sigmoid activation function, which will be optimised in this project. The output layer contains with four neurons that are used to derive the output class by soft-max regression.
After randomly initialising the weights according to a uniform Gaussian distribution, training is performed by minimising cross-entropy error on the training set using backpropragation with stochastic gradient descent, until the accuracy on the validation set has not increased six times in a row. This early stopping is done to prevent overfitting the network to the training data.
Cross-validation is used to evaluate the network performance that involves randomly splitting the dataset into ten subsets, of which eight are used for training, and one each used for validation and testing. I use the F-measure as the average of Precision and Recall to assess network performance. The F-measure is calculated after training the network in every combination of test and validation set (resulting in 10*9 = 90 combinations), before the mean of these individual F-measures is computed as an overall performance indicator.
To investigate the influence of the number of included MFCCs D and the number of neurons N on performance, a grid search is executed, calculating the performance when using any combination of {1, ..., 40} MFCCs with {1, ..., 40} neurons. Additionally, this process is followed for both MFCC40B and the variant MFCC80B with 80 bands.
When using the standard MFCC40B features, the resulting mean F-measures are displayed on the right.
For less than about five neurons, performance is significantly degraded, likely due to the lack of processing capability present in the network that is required for successful classification.
Performance starts low with a low number of MFCCs, but increases gradually when including more MFCCs, where the increase is most pronounced for the lower coefficients. This could be a result of the discrete cosine transform employed as a last step in MFCC calculation, which concentrates most of the information/energy in the spectrum in the first coefficients, effectly compressing spectral information.
Excluding the positive statistical outliers for some configurations resulting from the random variations induced by cross-validation and network initialisation, a good overall performance of about 0.74 is achieved, which requires only about 10 neurons and 18 MFCCs, making the resulting model compact and possibly more generalisable.
For comparison, the same figure used for the MFCC40B feature is depicted to the right for the MFCC80B feature.
Again, we see low performance with less than about four neurons, and an increasing performance when adding more MFCCs particularly with the lower coefficients.
However, performance is significantly higher compared than when using MFCC40B as a feature, reaching an F-measure of 0.7965 with only 8 neurons and 15 MFCCs, indicating the increased number of Mel bands used to compute the MFCC feature representation aids classification. One explanation could be the increased resolution in the lower coefficients that appear to be most relevant for phonation mode classification.
6. Conclusion
In this project, a dataset containing vowels sung in different phonation modes was used to extract timbral information in the form of MFCCs. In a feature analysis process, the MFCCs were then analysed in regards to dependencies on phonation mode and pitch. Visual inspection suggests the lower MFCCs to be of greater importance for phonation mode detection, while the higher coefficients appear to be pitch-dependent and therefore potentially less suitable as features. It also implied that the 0-th MFCC, which represents the average energy (approximation of loudness) in the signal in this case, is different for each phonation mode.
Two different variants of MFCCs are investigated: MFCC40B uses 40 Mel bands to describe the energies present in the spectrum, which can be considered the standard setting used in many applications, while MFCC80B uses 80 Mel bands, which is introduced with the intention of increasing the resolution in the lower coefficients for better classification.
Feed-forward neural networks with one hidden layer are trained on the dataset to classify the phonation mode using these MFCC variants, reaching good accuracies overall, which disproves the claim in previous literature [4] that MFCCs are not well suited for this task. Interestingly, peformance with 80 Mel bands (MFCC80B) is generally higher than with 40 Mel bands (MFCC40B), demonstrating the importance of the lower MFCCs and revealing the usefulness of optimising the number of Mel bands used as a parameter in future work for phonation mode classification. The best model reaches an F-measure of 0.7965 with only 8 hidden neurons and 15 MFCCs, a performance comparable with a recent approach achieving an F-measure of 0.84 [5].
7. References
[1] Johan Sundberg. The science of the singing voice. Illinois University Press, 1987.
[2] Elizabeth U. Grillo and Katherine Verdolini and. Evidence for distinguishing pressed, normal, resonant, and breathy voice qualities by laryngeal resistance and vocal efficiency in vocally trained subjects. Journal of Voice, 22(5):546 – 552, 2008.
[3] Polina Proutskova, Christophe Rhodes, Geraint A. Wiggins, and Tim Crawford. Breathy or resonant - A controlled and curated dataset for phonation mode detection in singing. In Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), pages 589–594, 2012.
[4] Polina Proutskova, Christophe Rhodes, Tim Crawford, and Geraint Wiggins. Breathy, resonant, pressed automatic detection of phonation mode from audio recordings of singing. Journal of New Music Research, 42(2):171–186, 2013.
[5] Leonidas Ioannidis, Jean-Luc Rouas, and Myriam Desainte-Catherine. Caracterisation et classification automatique des modes phonatoires en voix chant´ee. In XXXemes Journees d’etudes sur la parole, 2014