Decoding Speech in an ALS Patient

This piece, by Onno Berkan, was published on 10/01/25. The original text, by Willett et al., was published by Nature on 08/23/23.

This Stanford study aimed to tackle one of the most pressing topics in Brain-Computer Interfaces at present: decoding speech for individuals with paralysis. They had an ALS patient who could both speak sentences and imagine speaking, and recorded activity from the motor cortex. They tried to predict both mouth movements and the phonemes they were trying to produce. This allowed them to break down spoken words into mouth/face movements, as well as individual sounds. From these, they were able to predict what words the participant was trying to say with greater accuracy than recent models.

Imagine a life where you couldn't move – couldn't run, walk, or even speak. What would you do– what could you do? You may lose hope, spiral into depression, see life as a cruel joke… But what if there was some light at the end of the tunnel?

The tunnel, in this case, leads you to the Neural Prosthetics Translational Lab (NPTL) at Stanford. Here, the late Krishna Shenoy worked on neuroprosthetics to restore movement; his legacy is being carried on by Jaimie Henderson and Francis Willett, who are the last and first authors of the paper I'm writing on, respectively.

In this paper, the researchers wanted to decode the movement patterns that produce speech, both orofacial (mouth and face) movements and the phonemes (the sounds that combine to form words). If you can decode these, you can decode what someone's trying to say.

How does this decoding work? Every movement starts as a brain signal. Neural motor decoding records brain activity as one executes (or tries to) a movement, and tries to associate patterns in that brain activity with the movement.

They implanted two microelectrode arrays into 6v (ventral premotor cortex– shown to be involved in speech production in the past). They recorded the participant as they attempted individual orofacial movements, phonemes, or spoke single words in response to cues shown on a screen. The participant could produce these, but not any intelligible speech.

The researchers ran this data through a naive Bayesian classifier. This simple machine learning method learns the probability that the participant is attempting to perform a specific kind of movement, given the observed neural activity. Using this, they were able to achieve 92% accuracy in classifying between 33 orofacial movements and 62% accuracy in classifying 39 different phonemes.

Next, the researchers attempted to decode entire sentences in real-time. To achieve this, they trained a Recurrent Neural Network (RNN) to emit probabilities for each phoneme, based on neural activity, every 80 milliseconds. The RNN would take processed neural data and, for each 80-ms time window, try to predict the phoneme the participant was trying to produce. These phonemes (sounds) were then fed into a language model, which attempted to predict the word that the speaker was trying to say. Using an RNN allowed them to model the sequential structure of a sentence.

They attempted this real-time, word-level prediction on both a 50,000-word and a 125,000-word vocabulary, achieving word error rates of 9.1% and 23.8%, respectively. While this is great, the ~24% word error rate is not sufficient for daily life. Still very impressive, though…

Want to submit a piece? Or trying to write a piece and struggling? Check out the guides here!

Thank you for reading. Reminder: Byte Sized is open to everyone! Feel free to submit your piece. Please read the guides first though.

All submissions to berkan@usc.edu with the header “Byte Sized Submission” in Word Doc format please. Thank you!

Page updated

Google Sites

Report abuse