Samples for matrices based on perceptual errors in English and Spanish are below.
The data are from perceptual confusion studies in Cutler et al., 2004. At the time of writing (Fall 2013), full data for the studies were available from Cutler et al, at this address .
Cutler et al.'s study provides confusion data at signal-to-noise ratios (SNR) of 0, 8 and 16. The masking noise in all cases was multi-speaker babble. Our similarity matrix is based on confusion data at SNR 16, since the similarity scores thus obtained gave better results with our testset.
Cutler et al.'s data offer confusion scores grouped by the position of the segment in the syllable, (i.e. when the segments are in initial or final position), as well as averaged across positions. This is usefull to assess how likely consonants are to be confused as onsets vs. as codas, or to assess the likelihood of a vowel being confused in an open vs. a closed syllable.
For audio alignment, we obtained better results when basing our similarity matrix on the confusion results averaged across position (i.e. irrespective of syllable-initial or syllable-final status).
The scores in our similarity matrix represent perceptual phone confusion percentages, normalized into a 1-1000 range. A score of -500 i.e.,
1/2 × ( 0 – max( {ScoreRange} ) was stipulated for cases where no confusion had taken place for a phoneme-pair.
English Sample
The Spanish matrix was based on data provided by García Lecumberri et al., an extension of the corpus of human misperceptions in noise they developed and discuss in García Lecumberri et al. 2013.
They presented 69 native speakers of Spanish with over 20,000 singleword stimuli, under different masking-noise conditions, and asked the speakers to
write the word they had heard. The final misperception corpus, which totals 3,294 stimuli with their associated responses, contains only stimuli for which certain agreement thresholds were reached among participants’ responses.
The study is a free-response error-elicitation task, not a closed-response task like the Cutler et al. study we used for our English perceptual matrix (see above). However, we this corpus, since, unlike other Spanish perception studies, it provides data for all phonemes in our decoder’s phoneset.
For coherence with our English data, we based our matrices on the 1,838 stimuli where multi-speaker babble was used as the masker. SNR in these stimuli ranged between -8 and +1.
Our confusion matrix was based on 6,807 stimulus-response pairings from the study. For computing our confusion matrix, we compared the corpus’ stimulus and responses in cases where the response involved a single-phoneme error.
We recorded the percentage of matches and mismatches between each stimulus and each response in the stimulus' response-set (a maximum of 15 responses were available per stimulus). Match and mismatch percentages were normalized to a 1-1000 range. For phoneme pairs where no confusion had taken place, a score of -500 (i.e. ½ × (0 – max({Score Range}) was entered in the matrix. .
Spanish Sample