Milestones and Relevant Documents

Attached are a list of achievements and relevant documents. Here's our original proposal:

MSRFinalProposal.pdf

Human Science

The results from the first behavioral experiment are in and....

On the left, we show participants Levenshtein distance (a simple edit distance metric) between the true sentence and their response (y-axis) by condition (x-axis). We found a large effect between the only-video condition and the other two, which is to be expected because lip reading is impossible. People in that condition sent us some nasty comments from their MTurk e-mails. But, more importantly, notice how similar the audio+video and audio-only conditions look (red and green). On the right, we plot average Levenshtein distance over each of the 10 trials. There are learning effects in both the audio+video and audio only conditions, but again they look almost completely identical (and there is no statistically significant difference).

We then tested whether there were differences between audio-only and audio+video if you gave people the same speaker repeatedly. That is, we wanted to know whether people could learn to adapt better based on speaker identity. So we ran the same experiment as above, but gave people the same speaker saying a different sentence 10 times, instead of different speakers saying 10 different sentences.

The results, shown in the figure below on the left, support the hypothesis that people adapt to speakers better with video than without. The figure on the right also shows how adaptation to speakers relates to the difficulty of the sentence (binned into average, easy, and hard sentences): the benefit of adaptation for video+audio over audio is only strong for hard sentences.

The figure on the left shows Levenshtein distance over time. The decreasing slope over time indicates that people who can see and hear a person better adapt to them than those with those who can only hear. The figure on the shows how adaptation to speakers relates to the difficulty of the sentence (binned into easy, average, and hard sentences). Interestingly, the benefit of adaptation for video+audio over audio is only strong for hard sentences.

Interestingly, as the figure to the right shows, participants in the audio+video condition also took less time to respond on average than those in the audio only condition. So not only were they more accurate, but they were faster to respond, too.

model

Our speech recognition model is a modified implementation of Wavenet, and its architecture shown on the right. Mel Frequency Cepstral Coefficient (MFCC) features are extracted from the speech signals, passed through a dilated convolutional neural network, and output the predicted sentence.

The other two models, one of which predicts demographic information, and one of which extracts features from lip-reading, are all combined into this framework to maximize the information the network has access to, which could be relevant to speech recognition.

The best model turns out (unsurprisingly) to be the one that combines audio, lip-reading, and predicted demographic information. Here are a couple example videos used to test the model, with the speech predictions from the model given below each video. The purple around the lips are the model's lip-feature finding component.

faks0_sa2_annotated.mp4
mbdg0_sx203_annotated.mp4

Transcript: dont ask me to carry an oily rag like that

Prediction: dhan ask men cary oily rag like that

Transcript: the causeway ended abruptly at the shore

Prediction: the cause by ented a roply eer shroe

Model versus humans

We compared the sentences that the model did well on and didn't do so well on with the human data. The figure on the right is a scatter plot depicting with model versus human performance for each sentence on the y-axis. Interestingly, the model and humans performed well on similar sentences and poorly on similar sentences.