What Are we Doing?

Our goal is to improve speech recognition by using video as well as audio. A lot of speech recognition software relies solely upon auditory processing, yet humans may use additional cues to aid their understanding - context, lip-reading, demographic information. We are hoping to make a neural network model that takes advantage of these additional sources of information.

We will compare model performance to human performance and identify what kinds of video features are most useful for improving speech recognition in both our model and humans. We're not aware of any previous academic work that specifically investigates the usefulness of demographic information paired with audio-visual speech recognition. Furthermore, while some areas of psycholinguistics have investigated the influence of demographic information such as race on speech perception, no studies have directly investigated the effect of demographic information on accuracy of word/sentence recognition. Comparing what kinds of demographic information humans vs. our model use to optimize speech recognition is also theoretically important for cognitive models of behavior.