Before I dive into an explanation of this exercise, a recap is in order to contextualize the forthcoming analysis. Previously, we looked at outputs of OCRed pdfs and have gotten a glimpse of what the process of extracting text from the “wilderness” of various input formats (in the case of OCR, pdfs without text layers) could look like.
We now move to yet another “natural resource” from which text could be extracted: speech. This exercise could be viewed as a preamble to a larger project that aims to analyze the text output generated by NYU Stream’s speech-to-text algorithms of a spoken word show based in the UAE called Rooftop Rhythms (RR). With speech, gauging the transcription’s accuracy becomes of great relevance to the analysis of transcribed text. It is, therefore, vital to pose the question: are the speech-to-text algorithms employed by the platforms with a transcription service inherently biased against a certain subset of accents, specifically those spoken by people whose mother tongue is not English?
With the aim of answering this question, I chose segments of RR’s 9th anniversary episode on the 30th of April 2021 to see if I can spot a form of accent bias. I corrected an aggregate 15 minutes worth of transcription split up as follows: 3 minutes of Bill Bragin, the executive creative director of NYU Abu Dhabi’s Arts Center who has a standard American accent close to the one outputted by machine-generated American English accent, 3 minutes of Dorian Rogers, RR’s founder who’s got quite a pronounced American Southern accent, and 9 minutes of two poets, one with an African American accent and the other with a South Asian accent. For Dorian, a poet himself, and the two poets, I chose to include portions of both colloquial and poetic speech to investigate whether poetic cadence interferes with transcription quality.
A side-by-side comparison of Bill's original transcription (red) and the corrected version (green). The changes are reflected as highlighted parts
This sample was Bill’s and, unfortunately, it wasn’t of optimal quality. It had a moderate level of echo and static and the volume was not the loudest. I assumed the general similarity of Bill’s accent, pace, and voice to automated speech would compensate despite the quality of the audio. To my great surprise, this was one of the samples that the algorithm did the worst in. Not only did it fail to recognize Bill’s speech, often transcribing it as straight gibberish, but it also failed to transcribe a a relatively large portion of his segment, highlighted in orange. What I had also noticed is that this degree of omissions was not comparable to the other segments, some of which with similar audio quality to Bill’s. I got to thinking: has the algorithm become so fine-tuned to accents similar to Bill’s that it recognized the audio’s inaudibility and, thus, decided against transcription because it knows errors will be produced?
Segments highlighted in blue were inaudible, while those highlighted in orange are inserted by me as the algorithm failed to transcribe them.
Dorian's MC'ing original vs. corrected
Dorian's poetry original vs. corrected
From: https://yallaabudhabi.ae/home_post/rooftop-rhythms-abu-dhabi/
Dorian’s colloquial voice in the sample clearly enunciated every vowel and consonant and was slow-paced. Despite Dorian’s marked accent, the algorithm performed very well – I was definitely elated taking a breather correcting Dorian’s transcription. The poetry segment, however, was performed in person (from the 22nd of November 2019 episode), and the audio was not as clear, which explains the relative drop in the quality of the transcription. In Dorian’s poetry, most of the unrecognized words were infrequently used in normal speech and belonged to a more literary lexicon ("summon", "bearing", "velvet", etc.).
original vs. corrected
This sample was echoey but the poet’s voice compensated with its strength. The poet’s pace is varied; in moments of passion, it is sped up. Overall, no omissions were found – only words incorrectly transcribed into other English words, no gibberish.
So far, it seems like audio quality trumped accent. However, the speakers thus far are all native speakers of the language; introducing the next speaker into the analysis did skew my conclusion a fair bit in favor of accent bias.
The audio was clear, but the transcription quality was not the greatest, with the majority of the failures being in the poetry recital. I ascribe this to two factors. The first is that the poet kept code switching from English to Malayalam and, occasionally, to Arabic. Second, the poet’s accent was quite pronounced and her pace was fast. Additionally, by the sound of her breath, it felt like the poet was anxious, which led to poor enunciation in some words.
However, there was one observation that has previously occurred with Dorian that I could not look past. The algorithm transcribed the poet saying “poem” to “bomb”. Earlier in the episode, Dorian was transcribed to say "terrorism" in a segment of the transcription. Such mistakes in transcription (two counts in this episode alone) could disproportionately affect people of color; Dorian’s poetry, when stylometry was applied, was not recognized as Dorian except for a few segments, one of which includes the word “terrorist”.
Does this stem from actual bias-induced stereotypes that have managed to seep into the algorithm's design? While this observation could simply be the product of chance, it could also present compelling evidence for bias against people from certain identities that were recognized by the algorithm through voice. Since the algorithm used by NYU Stream is a black box to me at the moment, I cannot confirm or deny either statement, but this observation alone is, at the very least, a demonstration of the validity of the general concern over accent bias.
There were definitely some similarities in the errors: words on both extremes of the colloquialism spectrum were improperly transcribed – slang on one end (“swag”, “slamin'”, etc.) and infrequently used on the other (“Djembe”, “renege”, etc.). However, one cannot help but notice the difference in the type of errors. When quality of audio is comparable, errors in transcribing English spoken by people of color/non-native speakers seemed to be primarily a mapping of the spoken word to another English word, whereas the errors in the standard American accent sample were omissions/gibberish. Although we could see hints of bias in our small investigation, it is precisely the mechanism behind this mapping that should be further explored to understand the algorithmic motivations behind the variance in error type for different accents.
Ready for grading!
Date: 16th December 2021