Prosogram 3.05 + Polytonia
Pitch contour stylization based on a tonal perception model
Polytonia prosodic labeling of pitch levels and pitch movements
by Piet Mertens
Prosogram is a tool for the analysis and transcription of pitch variations in speech. Its stylization simulates the auditory perception of pitch by the listener. A key element in tonal perception is the segmentation of speech into syllable-sized elements, resulting from spectral change (sound timbre) and intensity variation.
The tool also provides measurements of prosodic features for individual syllables (such as duration, pitch, pitch movement direction and size), as well as prosodic properties of longer stretches of speech pronounced by a given speaker (such as speech rate, proportion of silent pauses, pitch range, and pitch trajectory).
The tool can easily interact with other software tools. It is used as a first step in the automatic phonological transcription of intonation, the detection of sentence stress and intonation boundaries.
Prosogram features
pitch stylization based on a model of tonal perception,
automatic acoustic segmentation of speech into syllable-sized elements,
alternatively, segmentation into rhymes, syllables or vowels, starting from phonetic and/or syllabic alignment in an annotation TextGrid,
two-pass F0 determination with automatic F0 detection range adjustment, or user-selected F0 range,
pitch range estimation per speaker (uses speech turn labeling),
prosodic profile per speaker, including pitch range, overall average pitch (median), pitch variability measures (F0 histogram, trajectory, proportion of level nuclei, histogram of glissandos up and down, histogram of inter-syllable pitch movements), speech rate, proportion of silent pauses,
drawings (prosograms) of pitch stylization together with user-selected tiers from annotation TextGrid, with/without acoustic parameters such as F0, intensity, voicing, pause, with/without pitch range, in many graphics file formats,
Polytonia prosodic labeling of pitch levels and pitch movements,
interactive viewing of stylization with playback, resynthesis, scrolling and zooming, optionally displaying pitch range, pitch targets in Hz or ST, and user-selected annotation tiers,
pitch normalization based on speaker's pitch range,
output table with numerous prosodic variables per syllabic nucleus (pitch: mean, median, high, low, start, end, pitch interval, glissando (up, down, none); duration of nucleus, rhyme, vowel, syllable; peak intensity; pause; speaker label),
prosodic profile output table (with rows per speaker and per file), for export to statistical analysis software,
batch processing of large-scale speech corpora, with folder management and provisions for on the fly (run-time) corpus annotation conversion,
saving automatic segmentation into syllables and syllabic nuclei to a TextGrid file for validation and editing
validation of phonetic and syllabic tiers in annotation TextGrid
Illustrations of Prosogram
The first illustration shows a light Prosogram with the stylization (black lines) and the pitch range (red horizontal lines indicating top, median and bottom). The annotations of sounds, syllables and words are provided by the corpus.
Wide, light, with pitch range
The next illustration shows a rich Prosogram, which adds the parameters of F0 (blue line), intensity (green line), and voicing (saw tooth), as well as the segmentation (red boxes), and the calibration of X and Y axes (in ST relative to 1 Hz, and in Hz). The vertical dotted lines correspond to the segmentation boundaries in the annotation.
Wide, rich
The third illustration shows a light Prosogram, in a more compact size.
Compact, light
The next figure shows a Prosogram using automatic segmentation into syllable-sized units. The magenta curve shows the intensity of the band-pass filtered speech signal, on which this segmentation is based.
Automatic segmentation
The last figure shows the screen of the interactive Prosogram. Here the user can interactively browse the speech signal and its stylization, play back parts (syllables, words...), and resynthesize the signal with the stylized pitch. (The tonal annotation in tier "polytonia" is obtained using Polytonia analysis.)
Interactive Prosogram window
Introduction
Rationale
Many phoneticians use the fundamental frequency (F0) curve to represent pitch contours in speech. F0 is an acoustic parameter; it provides useful information about the acoustic properties of the speech signal. But it certainly is not the most accurate representation of the intonation contour as it is perceived by human listeners.
In the '70, pitch contour stylization was introduced as a way to simplify the F0 curve to those aspects which are potentially relevant for speech communication. The approach originates from work by J. 't Hart and R. Collier at the I.P.O. (Institute for Perception Research) in Eindhoven ('t Hart et al. 1990), and was further improved by D. Hermes in the '80 and '90 (Hermes 2006). Other types of stylization have been proposed, such as the Momel system (Hirst & Espesser (1993), Hirst, Di Cristo, Espesser (2000). However, most of these stylization approaches are based on statistical or mathematical properties of the F0 data and ignore the facts of pitch perception.
It is well known that the auditory perception of pitch variations depends on many factors other than F0 variation itself. In 1995 a stylization based on the simulation of tonal perception was proposed by Ch. d'Alessandro & P. Mertens (Mertens & d'Alessandro, 1995, d'Alessandro & Mertens, 1995). The purpose of this stylization is to provide a representation which approximates the image in the listener's auditory memory. This tonal perception model was validated in listening experiments using stimuli resynthesized using the stylized contour (Mertens et al, 1997).
This approach may be used to obtain a low-level transcription of pitch level and pitch movement and. It requires a segmentation of the speech signal into syllable-sized units, motivated by phonetic, acoustic or perceptual properties. Various types of alignment may be obtained manually or automatically, and are stored in an annotation file (Praat's TextGrid file format). The Prosogram can use various types of segmentation:
an automatic segmentation into local peaks of intensity (both that of the band-pass filtered speech signal and that of the full band signal);
a phonetic alignment of speech sounds (or alternatively only vowels), ;
a syllable alignment;
a syllable rhyme alignment;
a segmentation provided by an external program.
The stylization is applied to the F0 curve of those segmented units (vowels, rhymes, syllables), which are approximations of the more sonorous part of the syllable.
How does it work?
The system includes several processing steps.
Calculate the acoustic parameters: F0, (full bandwidth) intensity, intensity of band-pass filtered speech, voicing (V/UV).
Obtain a segmentation into units of the type indicated above (e.g. vowel, rhyme, syllable, voiced portions). Within this segment select the voiced portion with sufficiently high intensity (using intensity difference thresholds relative to the local peak). This results in an estimate of the syllabic nucleus.
Detect silent pauses.
Stylize the F0 of the selected time intervals, i.e. syllabic nuclei.
Compute prosodic features for each syllable: minimum, maximum, mean, and median pitch, pitch at start and end of nucleus, intrasyllabic pitch change (in ST), intersyllabic pitch change, upward and downward pitch change, pitch trajectory, pitch-range normalized pitch, nucleus duration, nucleus start and end time, pause duration, etc.
Determine the pitch range used by the speaker. It will be used for speaker-dependent pitch normalisation and for phonological pitch level analysis (in Polytonia).
Plot the result: parameters, stylized pitch, segmentation, and some annotation tiers (text, phonetic transcription, etc.).
How is it implemented?
The system is implemented as a Praat script. Praat is a tool for acoustic and phonetic research, written by Paul Boersma and David Weenink, of the Institute of Phonetic Sciences in Amsterdam. The choice of Praat is motivated by the fact that it is powerful, user-friendly, programmable, freely available, running on many platforms, and actively maintained.
How to obtain the phonetic segmentation?
A suitable segmentation can be obtained in various ways.
Automatic segmentation
This type of segmentation does not require a preliminary segmentation into sounds or syllables; it is based on the acoustic signal only. It detects the local peaks in the intensity of the band-pass filtered speech signal, and adjusts their boundaries using the full band intensity, voicing and F0 discontinuities. (An earlier implementation uses the loudness, computed from the cochleagram. But this parameter introduces a smoothing which masks syllable boundaries at sonorous segments, such as nasals and glides.) The automatic segmentation does not identify the individual speech sounds, but only the central part of the syllable, referred to as the syllabic nucleus.
Manual segmentation
It can be made interactively using Praat and will be stored in a TextGrid file.
Semi-automatic segmentation using automatic phonetic alignment
The segmentation can be obtained semi-automatically, using automatic alignement between the speech signal and a phonetic transcription of the utterance, which will be based either on automatic speech recognition, or on speech synthesis. The phonetic transcription which is used can be obtained either manually, or using grapheme-to-phoneme conversion and natural language processing. In the latter case, an orthographic transcription of the words in the utterance is required.
Several tools for automatic alignment are available.
Train & Align, by S. Brognaux et al. (Louvain-la-Neuve, Mons)
Easy Align, by Jean-Philippe Goldman (Genève)
SPPAS, by B. Bigi (Aix-en-Provence)
MAUS, by Fl. Schiel (Munich)
Illustrations
A small corpus of spoken French was processed to illustrate the results obtained with the transcription tool. The corpus consists of about 4 minutes of an interview between Fayard and Benoîte Groult broadcasted on Radio de la Suisse Romande.
Audio files
Transciptions (Prosograms) (In Acrobat PDF format. When printing, use "Page Scaling: None")
PSOLA resynthesis from the stylized pitch contour.
A closer look at tonal perception and stylization
Some F0 variations are clearly perceived as rises or falls; others go unnoticed unless after repeated listening; still others are simply not perceived at all. Indeed, tonal perception depends upon several factors.
The auditory threshold for pitch variation, or glissando threshold G, specifies the minimal pitch interval required for a pitch variation of a given duration to be perceived as a changing pitch, rather than as a level tone. This threshold depends on the size and duration of the F0 variation. Since the work of J. 't Hart (1974, 1976), it is usually expressed in ST/s (semitones per second). In hearing experiments using short stimuli, either pure tones or speech-like signals, with repeated presentations, a threshold G = 0.16/T2 was measured.
Major changes in the spectral properties of the signal tend to function as boundaries (House 1990), breaking up a voiced continuum into a sequence of smaller parts corresponding to syllabic nuclei.
Major changes in signal amplitude tend to function as boundaries.
The presence of a pause following the F0 variation lowers the threshold for the perception of that variation (House 1995).
A change in slope is perceived provided it is sufficiently large. This is called the differential glissando threshold DG (d'Alessandro & Mertens 1995).
Our approach to pitch contour stylization takes into account
the segmentation into syllabic nuclei (high intensity region within the rhyme), due to spectral and amplitude changes,
the glissando threshold (G),
the effect of pause presence on the glissando threshold (G1, G2),
the differential glissando threshold (DG),
the minimal duration (dmin) for a plateau in a complex pitch movement.
The stylization shows the effect of a change of the model parameters on the estimated perceived pitch contour. This is shown in the next sample, which compares the F0 curve and two stylization variants: the first with G=0.16/T2, the second with G=0.32/T2, i.e. a glissando threshold twice as high. The (intravocalic) pitch movements found on "chefs" and "gieux", in the case of G=0.16/T2, no longer appear in the stylization with G=0.32/T2.
In speech communication, utterances are heard just once. There is no time for the listener to reflect on the auditory properties of the signal. This situation differs from that of a hearing experiment, where a stimulus is usually repeated several times and separated by long silent pauses. How then should the glissando parameter be chosen in order to obtain a correct representation of pitch perception in continuous speech? Resynthesized utterances (TD-PSOLA) using F0 stylizations for alternative settings of the above thresholds, and presenting them to listeners together with a resynthesis of the original utterance, one can determine the glissando threshold for which listeners are unable to distinguish the stylized pitch contour from the original one. The setting with G=0.32/T2 matches the performance of the listeners in continuous speech. To take into account the impact of a silent pause on the perception of the preceding pitch movement, the glissando threshold may be adjusted dynamically, depending on the presence of a pause. To obtain this behaviour, an adaptive glissando threshold (G=0.16-0.32/T2) is selected, where the low threshold is used before a silent pause, and the high one elsewhere.
Application to automatic transcription of intonation
The stylization by Prosogram has been used for automatic transcription of pitch contours and intonation.
A first type, called Polytonia (Mertens, 2014), indicates the pitch level and pitch movement of each syllable. Pitch levels are determined on the basis either of the speaker's pitch range and pitch intervals in the local context of a syllable ad within the syllable.
A second type identifies positions in prosodic structure, such as stressed syllable, pre-stress syllable, and prosodic boundary, and reinterprets Polytonia's pitch levels and movements in terms of such positions. This approach is called ToPPos (for Tones on Prosodic Positions) (Mertens, to appear).
References
Publications on the Prosogram
These papers are available on ResearchGate and/or Academia.
Mertens, Piet (2022) The Prosogram model for pitch stylization and its applications in intonation transcription. in Barnes, J.A. and Shattuck-Hufnagel, S. (eds) (2022) Prosodic Theory and Practice. Cambridge, MA: MIT Press. 259-286. ISBN 978-0-262-54317-0.
Mertens, Piet (2019) From pitch stylization to automatic tonal annotation of speech corpora. in Lacheret-Dujour, Anne; Kahane, Sylvain; Pietrandrea, Paola (eds) (2019) Rhapsodie. A prosodic and syntactic treebank for spoken French. Studies in Corpus Linguistics 89. Amsterdam: Benjamins, pp. 233-250. ISBN 9-789027-202208
Mertens, Piet (2014) Polytonia: a system for the automatic transcription of tonal aspects in speech corpora. Journal of Speech Sciences 4 (2), 17-57.
Mertens, Piet (2004) Un outil pour la transcription de la prosodie dans les corpus oraux. Traitement Automatique des langues 45 (2), 109-130.
Mertens, Piet (2004) The Prosogram : Semi-Automatic Transcription of Prosody based on a Tonal Perception Model. in B. Bel & I. Marlien (eds.) Proceedings of Speech Prosody 2004, Nara (Japan), 23-26 March. (ISBN 2-9518233-1-2)
Mertens, Piet (2004) Le prosogramme : une transcription semi-automatique de la prosodie. Cahiers de l'Institut de Linguistique de Louvain 30, 1-3, 7-25
Other references
Most of these are available on Researchgate and Academia.
Alessandro, C. d'; Mertens, P. (1995) Automatic pitch contour stylization using a model of tonal perception. Computer Speech and Language 9(3), 257-288.
Bardiaux, Alice & Mertens, Piet (2014) Normalisation des contours intonatifs et étude de la variation régionale en français. Nouveaux cahiers de linguistique française 31, 273-284.
Bartkova, Katarina; Delais-Roussarie, Elisabeth; Santiago-Vargas, Fabian (2012) PROSOTRAN: a tool to annotate prosodically non-standard data Speech Prosody 2012
Campione, Estelle & Véronis, Jean (2001) Etiquetage prosodique semi-automatique des corpus oraux. Actes TALN 2001, 123-132
Hart, J. 't (1974) Discriminability of the size of pitch movements in speech. I.P.O. Annual Progress Report 9, 56-63.
Hart, J. 't (1976) Psychoacoustic backgrounds of pitch contour stylisation. I.P.O. Annual Progress Report 11, 11-19.
Hart, J. 't (1979a) Explorations in automatic stylization of F0 curves. IPO-APR 14, 61-65.
Hart, J. 't (1981) Differential sensivity to pitch distance, particularly in speech. JASA 69(3), 811-821
Hart, J. 't; Collier, R. & Cohen, A. (1990) A perceptual study of intonation. Cambridge Studies in Speech Science and Communication. Cambrigde: Cambridge Univ. Press, 227 pp.
Hermes, D.J. (1987) Vowel-onset detection. IPO-APR 22,15-24.
Hermes, D. (2006) Stylization of pitch contours in: Sudhoff, Stefan; Lenertová, Denisa; Meyer, Roland; Pappert, Sandra; Augurzky, Petra; Mleinek, Ina; Richter, Nicole; Schließer, Johannes (eds) (2006) Methods in Empirical Prosody Research. De Gruyter. pp. 29-61.
Hermes, D.J. & Gestel, J.C. van (1991) The frequency scale of speech intonation. JASA 90(1), 97-102
Hirst, D. and Espesser, R. (1993) Automatic Modelling of Fundamental Frequency Using a Quadratic Spline Function. Travaux de l'Institut de Phonétique d'Aix-en-Provence, 15, 75-85, 1993.
Hirst, Daniel J. & Di Cristo, Albert (1998) A survey of intonation systems. in: Hirst, Daniel J. & Di Cristo, Albert (eds.) (1998) Intonation Systems. A Survey of Twenty Languages. Cambridge: Cambridge University Press. 1-44.
Hirst, D.J.; Di Cristo, A.; Espesser, R. (2000) Levels of representation and levels of analysis for intonation. in Horne, Merle (ed) Prosody: Theory and Experiment. pp. 51-87. Kluwer Academic Publishers
House, David (1990) Tonal Perception in Speech. Lund: Lund Univ. Press.
House, David (1995) The influence of silence on perceiving the preceding tonal contour. Proc. Int. Congr. Phonetic Sciences 13, vol. 1, 122-125 (Stockholm 1995)
Mertens, P. (1987a) L'intonation du français. De la description linguistique à la reconnaissance automatique. Unpublished Ph.D. (University of Leuven, Belgium), 2 vol., pp. 317 + 90.
Mertens, P. (1987b) Automatic segmentation of speech into syllables. Proceedings of the European Conference on Speech Technology. Edinburgh. Laver & Jack (eds) (1987), vol. II, 9-12.
Mertens, P. (1989) Automatic recognition of intonation in French and Dutch. Eurospeech 89, vol 1, 46-50 (Paris, September 1989, J.P. Tubach & J.J. Mariani (editors))
Mertens, Piet (2013) Automatic labelling of pitch levels and pitch movements in speech corpora in Bigi, Brigitte & Hirst, Daniel (2013) Proceedings TRASP 2013, Tools and Resources for the Analysis of Speech Prosody. (Aix-en-Provence, August 30, 2013), pp. 42-46. ISBN 978-2-7466-6443-2.
Mertens, P. & Alessandro, Ch. d' (1995) Pitch contour stylization using a tonal perception model. Proc. Int. Congr. Phonetic Sciences 13, 4, 228-231 (Stockholm 1995)
Mertens, P.; Beaugendre, F. & Alessandro, Ch. d' (1997) Comparing approaches to pitch contour stylization for speech synthesis. in Santen, J.P.H. van; Sproat, Richard W.; Olive, Joseph P. & Hirschberg, Julia (eds) (1997) Progress in Speech Synthesis. p. 347-363. N.Y.: Springer Verlag
Rietveld, A.C.M. (1984) Syllaben, klemtonen en de automatische detectie van beklemtoonde syllaben in het Nederlands. Ph. D., University of Nijmegen, 262pp.
Rossi, M. (1971a) Le seuil de glissando ou seuil de perception des variations tonales pour la parole. Phonetica 23, 1-33
Rossi, M. (1978a) La perception des glissandos descendants dans les contours prosodiques. Phonetica 35(1), 11-40
Rossi, M. (1978b) The perception of non-repetitive intensity glides on vowels. Journal of Phonetics 6(1), 9-18
Rossi, M. (1978c) Interactions of intensity glides and frequency glissandos. Language & Speech 21, 384-396
Rossi, M. (1979) Les configurations et l'interaction des pentes de F0 et de I. PICPS 9(1), 246(A)
Rossi, M.; Di Cristo, A.; Hirst, D.; Martin, Ph. & Nishinuma, Y. (1981) L'intonation. De l'acoustique à la sémantique. Paris: Klincksieck, 364 pp.
Spaai, G.W.G.; Storm, A.; Derksen, A.S.; Hermes, D.J. & Gigi E.F. (1993) An Intonation Meter for teaching intonation to profoundly deaf persons. IPO MAnuscript no. 968.
Applications
Patel, A.D.; Iversen, J.R.; & Rosenberg, J.C. (2006) Comparing the rhythm and melody of speech and music: The case of British English and French. Journal of the Acoustical Society of America 119, 3034-3047.
Martínez-Sánchez, Francisco; José Antonio Muela-Martínez, Pedro Cortés-Soto, Juan José García Meilán, Juan Antonio Vera Ferrándiz, Amaro Egea Caparrós and Isabel María Pujante Valverde1 (2015) Can the Acoustic Analysis of Expressive Prosody Discriminate Schizophrenia? The Spanish Journal of Psychology 18, e86, 1–9.
Shamei, A., & Bird, S. (2019). An Acoustic Analysis Of Cannabis-Intoxicated Speech. Canadian Acoustics 47(3), 108-109. (Available from https://jcaa.caa-aca.ca/index.php/jcaa/article/view/3343)
Frequently asked questions about Prosogram
Which languages may this stylization be applied to?
Prosogram may be applied to any language, because it analyses the acoustic signal produced by the speech organs, which are basically the same for all people, irrespective of their language.
Prosogram has been applied to French, Dutch, Italian, English, Greek, Spanish, German, Swedish, Polish, Kirundi (an African language), as well as to tone languages such as Mandarin Chinese. It has even been used in the study of sounds produced by animals (bears, whales), music instruments, as well as the singing voice.
Of course, languages use different sets of sounds: some languages don't use the sounds [x] or [ç], some languages don't have nasal vowels, and so on. Some sounds may occur more frequently in one language than in the other (for instance, asian tone languages have lots of glides). And some languages have quite complex consonant clusters, while others mostly use CV syllables. This affects syllabification to some extent, but the impact on pitch stylization will still be limited, because of the relatively low intensity of many consonants.
When I run the script, a drawing appears in the Praat Picture window, then disappears immediately
The Praat picture window is used only as a buffer for drawing the image, which is immediately saved in a graphics file. Then the picture window (buffer) is cleared to prepare for the next drawing. To view the prosograms saved in the graphics file(s), read the following section of the Users's guide: Viewing and printing prosograms .
How do I view the prosograms?
Read the following section of the Users's guide: Viewing and printing prosograms.
How to include prosograms in Word documents or Powerpoint presentations?
Obtain the prosogram in PNG graphics format and insert this PNG picture file in Word. For more details, read the following section of the Users's guide: Viewing and printing prosograms.
How do I print prosograms?
Read the following section of the Users's guide: Viewing and printing prosograms.
How to include prosograms in ELAN?
Read the following section of the User's guide: Viewing prosograms in ELAN.
I want to use the stylization in another program
Read the following section of the User's guide: Exporting the stylization to other programs.
How do I make a resynthesis based on the stylization?
Read the following section of the User's guide: Resynthesized speech based on the stylized pitch.
I need help to make Prosogram work as expected
If you don't find a solution in the User's guide, feel free to contact the author. Describe your problem, and the selected settings (analysis settings, plotting options), and include the audio file and the annotation TextGrid you are trying to analyse.
Page initially created (on bach.arts.kuleuven.be): 2002-06-20. Last updated (on sites.google.com): 2022-12-28.