Polytonia

Sample Polytonia labeling for French
poly_fr_01.wav

Polytonia indicates both (1) a notation for transcribing pitch levels and pitch movements in speech, and (2) an algorithm for obtaining this transcription starting from the acoustic speech signal. Detailed information on both these aspects is provided in the following publications: Mertens (2013, 2014, 2019). The program itself is available here.

Polytonia's prosodic transcription does not assume a particular phonological theory of intonation, adopting a set of predefined tonal elements, units and positions (e.g. so-called "pitch accents" and "prosodic unit boundaries"), which are postulated for a given language, or even universally. Rather it aims to determine the pitch level and movement of each syllable in a manner motivated by tonal perception, in such a way that these (categorised) observations can serve as an input to phonological analysis and interpretation. A phonological theory of intonation does require additional (acoustic) information to determine the position of stress ("pitch accent", in autosegmental metrical approaches) and prosodic unit boundaries. (See below.)

Sample Polytonia labeling for English
poly_en_01.wav

The notation distinguishes 5 pitch levels: L (low), M (mid), H (high), B (bottom) and T (top). The first three are defined relative to one another; the latter two are defined relative to the pitch range of the spreaker. In addition there are 5 pitch movements: R (large rise), F (large fall), r (small rise), f (small fall), and _ (level). Levels and movements may be combined, e.g. HF indicates a large fall starting from a high pitch level.

The algorithm is based on the prosodic features of the syllables, obtained by segmenting the speech continuum into syllables, by stylization of the F0 curve, and by obtaining the pitch range of the speaker. All these processing steps are readily available in the Prosogram system, on which Polytonia builds.

The quality of the obtained tonal transcription obviously depends on the accuracy of F0 measurement, segmentation, stylization, speaker turn identification, pitch range estimation, hesitation detection, and so on, and eventually on the recording conditions.

In Polytonia, pitch level assignment is primarily based on the detection of local pitch changes in speech, either within a given syllable, or in its local left context. A secondary source of information is pitch range, which serves as an indicator of pitch level in the absence of pitch change, and which determines the pitch intervals used to categorize pitch changes.

Sample Polytonia labeling for French
poly_fr_02.wav

Distinctive features of Polytonia (compared to other approaches):

    1. pitch levels are synchronized with syllabic nuclei, more specifically with the vowel onset;
    2. pitch movements may be associated with syllables, more specifically with the syllable rhyme (nucleus and coda). In this way pitch movements, both those within a single syllable (glissando) as well as those extending over a sequence of syllables, can be transcribed.
Sample Polytonia labeling for Greek
poly_gr_01.wav

Issues

Should Polytonia be considered as an autosegmental approach?

A given pitch contour may combine with phrases of various length, ranging from one syllable to many, for instance when we pronounce each of the names "John", "Mary", "Christopher", and "Alexander" with the same focus intonation in reply to a question such as "Who did you see?". The key insight of autosegmental analysis of prosody (already present in many structuralist analyses of intonation) is this: the prosodic (suprasegmental) and segmental layers of speech use units and domains of their own and both types of units do not necessarily coincide. It suffices both layers are correctly associated. This association uses prosodic positions, such as accent and prosodic boundaries. A "pitch accent tone" is associated with a word accent position (lexical stress); a "boundary tone" with a prosodic boundary position. Between the tones in these successive positions, pitch is "interpolated" (connected), forming a continuous curve, taking into account the physiology and acoustics of the vocal folds (limitations on pitch change speed, micro-prosodic phenomena).

Clearly, Polytonia is not an autosegmental transcription: it doesn't define prosodic positions, but rather assigns pitch levels and movements to as many syllables as possible. This choice is motivated by two main considerations. First, automatic autosegmental labeling requires the prior detection of stress position (accent, if you prefer) and prosodic boundaries, using acoustic information (rather than manual annotation), something which appears to be particularly hard. Second, the acoustic correlates of stress and prosodic boundaries are likely to be language-specific.

However, once the prosodic positions in utterances become available, the mapping of Polytonia onto an autosegmental transcription becomes straightforward. Whatever the prosodic domain, prosodic positions, including boundaries of prosodic units, coincide with a syllable: stress is located on a particular syllable (which is more prominent than the adjacent syllables), prosodic boundary tones are located on the syllable adjacent to these boundaries, and multiple tones can coincide within the same syllable, be it in a given order (for instance a one-syllable utterance such as "Yes" will combine a stress tone and a final boundary tone, in that order).

Polytonia and other prosodic labeling systems

INTSINT notation & the Momel system

The INTSINT symbols (Hirst & Di Cristo, 1998) (T (top), M (mid), B (bottom), H (higher), S (same), L (lower), U (upstepped), and D (downstepped) ) are defined differently from those of Polytonia. In INTSINT all movements are represented as sequences of levels.

The Momel system (Hirst & Espesser, 1993; Hirst, Di Cristo, Espesser, 2000) models the raw pitch data (by a quadratic spline function) as a very smooth pitch curve (interpolating over unvoiced portions and silent pauses), and locates pitch targets along the time axis, corresponding to turning points (changes in direction) of the curve. Their position fully depends on the shape of the modelled pitch curve, rather than on phonetic/spectral information, syllable structure or loudness. When the modelled curve changes direction during a pause, for instance, Momel's pitch target will be located there. As INTSINT symbols are associated with these targets, it is impossible to represent short-term pitch changes, such as a glissando on a syllable, unless their shape is in line with that of the context.

PROSOTRAN

The PROSOTRAN (Bartkova et al. 2012) system computes a symbolic transcription for duration, pitch and intensity, basically using descriptive statistics (median, standard deviation...) for these parameters.

For duration, syllabic nucleus duration is compared to mean duration. For pitch level, the pitch range (in ST) is divided into 6 parts, which remain constant throughout the utterance (or even the recording). Notice that such globally defined pitch levels are unable to deal with the pitch declination frequently observed as subglottal pressure decreases. Four pitch movement categories are identified: falling, rising, falling-rising and rising-falling. Pitch slopes are categorized relative to the glissando threshold. For intensity, vowel intensity (normalized for vowel aperture) is compared to the mean intensity within the interpausal interval.