There has been a huge amount of interest in the new PLoS Biology paper from Chang’s lab ) in which 15 patients undergoing presurgical cortical mapping were studied while they listened to speech. The signal from the mesh of electrodes was analysed with respect to the original speech signal, and in the patients tested (n=4) with the high density electrode nets, good correspondences could be mapped between the patterns of cortical responses and the original speech, and a pattern recognition system could be trained to predict with accuracy when the patients were presented with speech items which were not used for training. These cortical patterns were also used to generate ‘reconstructions’ of the speech. The accuracy of these reconstructions was assessed by correlating them with the original stimuli.
This is a technically astonishing piece of work in terms both of the use of patterns of decoding patterns of electrical activity from the superior temporal gyrus and in the resynthesis of speech from these patterns. What it is not, of course, is mind reading or thought decoding. The press coverage often invoked these concepts, however the authors did not ask people just to think of words and try to decode those. Given the variability in the nature of ‘inner speech’, let alone the relationship of this to thought, this may be one big step.
There are a few other things I’m itching to know about these data, and one is the extent to which the signal they see is specific to the linguistic aspects of the stimuli or the acoustic patterns. The single word stimuli were recorded by one speaker, and other approaches to trying to explore the nature of speech processing in the STG have used a smaller sub set of stimuli (e.g. vowels) and a variety of speakers, so as to allow the big speaker differences to be controlled away (e.g. Jonas Obleser's work). Alternatively people use a non-speech baseline, like the spectrally rotated speech we and other labs often use (there are some examples here), which allows you to distinguish between neural responses which are common to speech and complex non-speech sounds, and those which are specific to speech. The lack of a baseline means that the paper’s aim to shed light on how we decode speech cortically suffers a little – are we looking at linguistic or purely auditory processing?
The accuracy of the reconstructed speech was assessed by correlating it with the original speech signal. This is good because it gives a value to the accuracy: the higher the correlation, the better the reconstruction. However it would be also good to run it past human listeners: how much they can decode from the speech and what errors they make would be very informative about the kind of information encoded in the speech reconstructions. It would also rule out the possibility that the correlations are driven by acoustic properties that are not central to the ways that humans process speech: for example the spectrograms they show contain several spots of high frequency energy just under 7kHz. This is above the range of frequencies that are critical to speech, and if correlations were driven by this kind of information it could well reflect a reconstruction process driven by purely acoustic, rather than linguistically sensitive processes. It’s also striking that reconstruction accuracy was lowest for sentences, when speech perception is more efficient for sentences than for single words (perhaps because predictive coding is possible).
None of this takes away from the many strengths of the paper, and my writing this has been largely driven by a day of trying to explain that the paper was not describing mind reading to various people, none of whom (strangely) wanted to hear me bang on about baselines. Being more systematic with their stimuli and analysis could have made this amazing paper and outstanding one, is all.
Listening in >