Formant Synthesizer-Based Singing Vowel Generation

(2023)

Final project for Digital Sound Synthesis and Audio Processing (Fall 2023)

A synthesis-by-rule approach to real-time singing vowel generation, based on the source-filter model using original formant data.

KEYWORDS:

Vocal Synthesis; Formant Synthesis; Source Filter; Sample Library/Dataset; Max/MSP

This project aims to implement a singing vowel generation algorithm in real-time using source filter synthesis principles. Static formant parameters (central frequency, bandwidth, amplitude) for the first five (5) formants are extracted with Praat from original recordings of choral singers and applied to the parallel filter banks. An excitation signal is constructed from an impulse train generator and a noise generator to simulate the voiced (harmonic) and unvoiced (noise) characteristics as well as other temporal or timbral modulations. The overall pipeline is presented as an interactive Max/MSP application and instrument, with further analysis and discussions on optimizing model performance.

A demo of the singing vowel generator patch.

Links to explore:Project PaperPlay with the patch

Schematic of glottal source modeling used for this project.

Modeling the Glottal Source

The excitation impulses from the glottis are generated by modeling with two components: the harmonic ("voiced") component, an impulse train generator for periodic excitation, and a noise generator for the inharmonic ("unvoiced") component contributing to voice roughness.

The harmonic component consists of:

An impulse train generator with the train~ function, generating impulses at the fundamental frequency;
A jitter component with the rand~ function, introducing random variations to the fundamental frequency value at a given rate and deviation strength;
A vibrato module with the tri~ function modulating the fundamental frequency at ~6% deviation, with customizable amplitude envelope triggered at a new note onset/change of syllable.

The inharmonic component is modeled with a pink noise generator bandpassed at 1 - 4 kHz and periodically activated by the train impulses. Adding filtered pink noise of varying amplitude to the impulse train can simulate breathiness and whispering effects.

Formant analysis result of a Tenor E4 note on a short A syllable.

Formant Source Filters and Data Extraction

Formant data is extracted from the recordings in the Spheringer dataset, an open-source choir sample library covering nine commonly used syllables across all four voice types spanning 2.5 octaves, multiple registers, dynamics, and articulation techniques. Specifically, the recordings of five basic vowels (A, E, IY, O, U) by a soprano and a tenor on E4 are analyzed. The soprano performs the note with a chest-to-lower-mixed vocal register and mezzo-piano (mp) dynamic; the tenor performs the note with a higher-mixed-to-head vocal register and fortissimo (ff) dynamic.

The recordings are analyzed with Praat (a phonetics analysis software) over sustained, stable periods of the harmonic part of each sung vowel, and average values over the selected periods are documented. The formant data (central frequency, bandwidth at half amplitude/Q factor, gain) are collected to implement a filter bank of five (5) parallel second-order filters, calculated and implemented with the biquad~ and filtergraph~ functions.

Formant data extracted from the 10 audio samples from the Spheringer dataset.

Evaluation and Analysis

In general, the formant synthesizer based on the source filter method works decently in producing the sustain period of the basic vowels and performs better on generating open vowels (A and E) than close vowels (IY, O, U). This is possibly due to the lack of a consonant generator in the pipeline and low amplitude of the higher formant filters, which contributes largely to consonant identification. The manner of performance for these recordings may also make it difficult to perceive accurately the vowel characteristics, as choral singers generally aim to perform close vowels similar to their open counterparts for a rounded, mellow timbre.

The Max/MSP patch has demonstrated flexibility in real-time implementation of spectral and temporal variations such as jitter, vibrato, and noise, and is convenient in an interactive performance setting.

Future updates should aim to incorporate a consonant generator as well as a semi-automatic, STFT-based analysis stage preceding the synthesis pipeline choose the formant parameters for the bank of filters.

return to: Home

Page updated

Google Sites

Report abuse