Part 4: Hybrid Reading and Text Analysis

Stylistic analysis of Dorian rogers' speech

A deeper dive into analyzing stylistic profiles of speech from Rooftop Rhythms using the show's raw transcriptions.

by Sarah Al-Towaity and Arshiya Khattak

We’ve ventured far and wide into the world of computer assisted reading. But so far, we’ve been delving into the written word. Before one writes — or types — there’s always a moment of reflection behind every word. The moment may be fleeting, just a quarter of a second, but it’s always there. Speech is far less curated than writing, presenting an exciting avenue for research.

And what better way to study this than Rooftop Rhythms (hereafter: RR), a hub for spoken word?

RR is an open-mic night event where poets and musicians from all over the UAE come together to showcase their talents. All the techniques we’ve explored so far have aimed to discover patterns and themes in texts generally written by individual authors. Because the participants in RR are so varied, it would be difficult to come to any meaningful conclusions. However, a realization suddenly dawned on us — we could study the MC, Dorian Rogers, the common denominator in the episodes and the face of RR.

Corpus Preparation and Rolling Stylometry

What we needed was a magic tool — one that would glance over a block of text and declare “‘Tis Dorian!”. Luckily, R has a package called “stylo” that does more or less the same thing. “Stylo” refers to the technique of Rolling Stylometry, where text is segmented into blocks, which are then classified sequentially into a category of other texts based on stylistic similarity. Given that we wanted to “roll” over an RR episode, detecting Dorian-speak, using this technique would help us analyze Dorian's speech characteristics. Using the function rolling.classify() from the package, we could train a machine learning model that the package supports using a training corpus. In the same way humans differentiate between what is and what isn’t by seeing multiple examples, we fed our model two large text files, one containing exclusively Dorian segments, while the other is non-Dorian. To do this, we selected four RR episodes to parse. As we parsed through the episode, we inserted a marker indicating where Dorian started speaking. Now, we only had to put our model to the test.

Experiment and Analysis: Rolling Stylometry

Analyzing Model Visualizations

We tested our model by using each of the four episodes as a test case. The Support Vector Machine model (the model we picked) would “roll” over the episodes and display the block it predicted to belong to Dorian with high probability in red, while other blocks are in green. Occasionally, the visualizations would produce a shadowed block in the opposite color on top of the main block, the size of which indicates the probability that the segment in question belongs to the alternate category.

Episode: 22 November 2019

Parameters: slice.size = 100, slice.overlap= 50, mfw (most frequent words) = 50, training.set.sampling = "normal.sampling"

Episode: 17 April 2020

Parameters: slice.size = 100, slice.overlap= 50, mfw (most frequent words) = 50, training.set.sampling = "normal.sampling"

Episode: 23 Oct 2020

Parameters: slice.size = 100, slice.overlap= 50, mfw (most frequent words) = 50, training.set.sampling = "normal.sampling"

Observation: Most of the vertical dotted. lines fall within the red "Dorian" blocks . Even when they do not, the majority of lines falling in the green areas simultaneously fall into a red shadowed block

Episode: 30 April 2021

Parameters: slice.size = 100, slice.overlap= 50, mfw (most frequent words) = 50, training.set.sampling = "normal.sampling"

The produced visualizations show that the model did a good job identifying Dorian. Guided by the dotted vertical line saying “dorian”, it is evident that in the majority of the episodes, the dotted line would fall in the red blocks more often the green ones. Of course, to achieve the most accurate results, we had to systematically increase the number of tokens in the files of the training corpus.

vs.

Expanding the test files, we notice that the blocks have become narrower, encapsulating more of Dorian. Additionally, the model has become more certain, as there are smaller and less frequent shadowed blocks.

Whether we fed the model little data or not, one thing remained consistent: the structure of the episode. Our results, with the blocks’ general trend of alternating colors, is consistent with RR’s general format and Dorian’s role as an MC, coming in primarily at the beginning and end and intermittently between poets. In conversation with Dorian, he pointed out ingeniously about the ways he could assess his performance as an MC using these visualizations!

Speed and Distant Reading

Of course, the model’s algorithm was a black box to us: we know where Dorian might be speaking in the episode, but not the stylistic elements that helped the model construct the visualization. However, we performed some form of speed reading as we parsed the episodes, trying to “read like a computer”. We theorized that as an MC, Dorian repeated certain words, such as “Thank”, “show love”, and “everybody” consistently, forming “fringe” headers and footers around each Dorian block. Seeing as the model relied on most frequent words, this was quite plausible.

Characteristics of Dorian's Spoken Word, the Conversational vs the Poetic

Our results with R were fascinating — but is it really that easy to figure out what it means to be Dorian? He’s a multifaceted being and MC-ing is only one face of his identity. In fact, a far more important part of his selfhood is his poetry. In our conversation with him, he said that he didn’t experience any different sense of self while reciting poetry; his work is a reflection of the way he speaks — conversational, yet profound.

Optical Character Recognition (OCR)

We needed to create a compilation of Dorian’s poetry. Dorian had courteously gifted us a physical copy of his poetry book, The Million Mile Stare, a few months prior, so all we had to do was take photos and convert them into a .txt format. We used an accessible form of OCR — Google doc. The conversions were surprisingly accurate, with some exceptions.

AntConc

Our compilation of Dorian’s poetry amounted to about four thousand words. The top words from Dorian’s MC-ing weren’t surprising, pertaining specifically to NYU Abu Dhabi and the UAE. However, removing stop words, we notice that a massive part of MC Dorian’s speech is just filler words. Glancing at the top word list of Dorian’s poetry, we can already see that the content is much more meaningful, containing words such as “god”, “death” and “wonder”. The most interesting finding was using the Keyword List tool with Dorian’s regular speech as a reference file: it came back empty. Even though we both are AntConc veterans, neither of us had seen this before. Does this imply that there aren’t any words that are comparable between regular and poet Dorian?

Dorian's MC speech word list

Dorian's poetry speed word list

Dorian's MC speech word list without the stop words list

Dorian's poetry with MC Dorian speech as reference file

RStudio: Rolling Stylometry

Running the model with the file containing regular Dorian speech as our training set and Dorian’s poetry, the majority of the book is predicted to be non-Dorian, except a few peculiar segments, each having the length of a typical poem.

We then inspected the text in the region specified by the visualization to attempt to come to a nuanced finding. The poems center around police brutality, racism, and inequality —the same themes he identified as being recurrent motifs in his poetry. Perhaps, Dorian interacted more with poets who recited poems around similar issues and that was picked up by the model. However, overall the model did not do so well. Dorian’s emphasis on show and not tell evidently did not work with the model’s more localized approach.

Close reading of Dorian's OCR'd poetry book

RStudio: Similarity Scatterplot

We wanted to study Dorian’s poetry further: we contrasted his work with the poetry from the RR performers. We asked Dorian how his poetry compared to the poetry scene within the UAE. “I definitely think there are similarities between me and the poets at the open-mic,” he answered. “Both of us focus on childhood and social activism.” We see in The Million Mile Stare that Dorian’s poetry does focus on social activism, especially on the racial climate within the USA. He did note some differences, stating that his poetry veered more towards being page poetry than spoken word.

To visualize this, we ran a text file of Dorian as an MC, Dorian as a poet and the other performer’s poetry in RStudio, plotting a graph comparing all three. On the graph of Dorian as a poet versus as an MC, we see that Dorian uses the words “deep”, “american” and, oddly, “ahmed” at almost the same frequency (possibly because Dorian has poems about the racially-charged murder of Ahmed Arbery, and there are probably many performers named Ahmed at RR). In Dorian’s poetry versus other people’s poetry, “god” and “deep” appeared commonly between the two, while words such as “demons”, “love”, “black” and “children” are more on Dorian’s side, suggesting that Dorian’s poetry delves into personal and specific themes, while the poems across RR are more varied.

Influence of Transcription Errors on the Analysis

All of our previous analysis is contingent on having faith in the speech-to-text conversion algorithm of NYU Stream, the service used to provide transcriptions of RR episodes. Whether it be the audio quality or some sophisticated infiltration of bias into the algorithm, we knew the transcriptions were impartial to certain speakers. The transcriptions’ rawness presented the possibility that we were making assumptions based on words that were never uttered but simply incorrectly transcribed. We wanted to test whether cleaning portions of the transcriptions will alter the results of stylometry.

Our approach was simple: we correct Dorian and non-Dorian portions of the transcriptions, test them against the training sample of raw transcriptions, and see if the model will still correctly deduce the category. To our delight, the model correctly predicted the category of each sample. Although it did poorer with Dorian than with the other sample, the predominance of red was still apparent. Overall, it seems the transcription errors were not potent enough to muddy the stylistic profile of Dorian’s speech.

Dorian's corrected sample is predicted to be predominantly Dorian.

Other speakers' corrected sample is predicted to be predominantly non-Dorian.

Conclusion

Our study of Dorian goes in so many directions that there can be a myriad of conclusions made. The most notable could be the limitations of the software in detecting the variation of speech, like the model’s failure to identify Dorian’s poetry as his. Because a person produces speech in different modes – poetic, colloquial, formal–, it is hard, even for us humans, to fully grasp the underlying marked characteristics of their speech.

Ready for grading!

Date: 16th December 2021

Page updated

Report abuse