It is very easy to alter the volume of a vocal sample to sound louder, but to really change the perceived dynamic level of the speech, many other factors have to be taken into account. This project explores the differences between the voice at different dynamic levels, and attempts to convert some recorded samples between them.
To analyse the differences in the voice at different dynamic levels (and to give me samples to try manipulating), I recorded myself saying "This is a recording of my voice" at different dynamic levels, labelled 'recording_low', 'recording_medium', and 'recording_high'. I also recorded a similar set of samples of me saying 'Mind the gap'. The first phrase is a reasonably neutral declarative phrase, and the second phrase is imperative.
Analysis
I decided to focus on the spectral aspects of what makes loud speech sound louder, as opposed to the timing aspects, so I time warped each of the samples to have the same timing. Analysis of samples was done using Praat and Matlab.
F0 (Hz)
I analysed the mean, minimum and maximum F0 of each sample using Praat, and plotted the F0 over time from Praat using Matlab (so that I could compare the different samples more easily)
As I spoke louder, the average F0 of the sample and overall F0 variation increased significantly. As can be seen in the graph of F0 values above, this effect was significantly more pronounced towards the beginning of the sample. I believe this is due to the natural drop in pitch at the end of a phrase when making a statement, as the line I recorded was fairly neutral.
A similar graph of me saying 'Mind the gap' shows that there is less of a difference between the beginning and the end of the phrase for imperative phrases. (Note: the mind the gap samples weren't warped to the same timing, so the lines on the graph are slightly out of sync)
Intensity (dB)
Analsyed in the same was as F0. I took a second minimum, ignoring the starts of each sample. This was because Praat calculated very low values for the initial intensity of the sample, before my speech had even started. (this can be seen in the graph below)
Recordings were done using a phone, and I didn't have a reference sound to help calculate the true amplitude of the signal. Therefore, the intensity values likely aren't completely accurate.
Going from a low mumble to louder speech increases the intensity. The medium and high voices are both very close to each-other in intensity, and all three voices had very similar intensity curves over the course of the phrase.
Formant
It was hard to compare numbers for the formant values, as Praat's formant detection tends to jump to the next formant if it can't find a value for a lower formant. To start the comparison, I first put all the formant values together in a graph without any connecting lines.
Looking at the formants graphed in matlab, the formant frequencies are very similar.
There is a difference in the high recordings formants at around ~0.4s, due to the difference in pronunciation of the word 'a', and there are noise areas on the graph where Praat was trying to find the formant of unvoiced sounds. The best example of this is the noise a little after the 1.4s mark, which was was calculated from the 's' sound at the end of 'voice'.
Extracting the voiced and unvoiced segments of the samples using Praat Vocal Toolkit [1] allowed me to get a much better look at the formant frequencies for the voiced sounds. Manually fixing the data so that most of the formant jumps were removed also allowed me to calculate the means of each formant and better visualise their changes over time.
Judging by these values, louder speech increases the frequency of the first formant and doesn't have much effect on the others.
Spectrum
I used Matlab to produce averaged frequency spectrums of each of the samples. The louder voices have more power over the spectrum as a whole, but the most significant increase in power seems to be around 200-300 Hz.
Voice Transformation
To test the viability of voice transformation between different dynamic levels, I used the Praat Vocal Toolkit to copy features between samples. From the above analysis, I concluded that the most important features were pitch, intensity, and the shape of the frequency spectrum. I tested conversion between the low and high samples in both directions to evaulate which way works better, and any problems that still exist in the results.
High -> Low
I expected high to low to be the easier of the two, as removing spectral content is easier than adding it.
Low sample spectrogram:
High converted to low spectrogram:
Compared to the high voice, the converted voice sounds significantly softer spoken. The end result is very close to the low voice, but sounds slightly more nasal and clear. The voice sounds slightly robotic due to quality loss during conversion, but the quality is reasonably good.
Low -> High
High sample spectrogram:
Low converted to high spectrogram:
Compared to the low voice, the converted voice sounds a lot louder spoken. However, compared to the high voice, the converted voice sounds much more breathy and much less clear. Overall, this conversion definitely succeeded in making the original voice sound louder spoken, but is still a distance away from sounding natural.
Formants
I also tested the effects of copying formants using the Praat Vocal Toolkit. As expected, it didn't have much effect on the sound of the voice.
Links and References
[1] "Praat Vocal Toolkit". Praatvocaltoolkit.com. N.p., 2017. Web. 3 Apr. 2017.