The goal of this project was to synthesize a means of accent detection and correction to help with the pronunciation of challenging English words. Before starting this project, I hypothesized that performing the FFT spectrum of the same word pronunciation would yield similar and consistent results. I tested this hypothesis, and then performed some signal processing to construct a feedback mechanism that would provide the speaker various visuals to improve his/her pronunciation of a specific word.
The first step in this project was to verify that the FFT spectrum was indeed a valid means of measuring word pronunciation. To do this, I first performed a consistency test by pronouncing the same word multiple times into a microphone and comparing their respective spectrum. Below you can see two of my own pronunciation of the word "squirrel".
From above, one can see that the overall shape of the two pronunciations in the frequency domain were quite consistent. To readability, I implemented a simple moving average filter to smooth out the signals.
Once the proof-of-concept was in place, I implemented a few additional processing steps to address a few issues before I tested the scripts on other people. The first issue was speaking volume. To ensure that the peaks of different trials would line up, I normalized the spectrum by dividing the data by the maximum value. The second issue was the possibility of having two pronunciations that were similar but differed in pitch (e.g. a girl's voice vs. a boy's voice). To accommodate for that, I performed a cross correlation of the two signals that I wanted to compare, and then shifted the signals by the amount that corresponded to index of the highest cross correlation. Finally, I applied a threshold minimum to only preserve the dominant peaks and remove the low amplitude components. These processes can be seen below: