Here’s the research setup: A woman speaks Dutch into a microphone, while 11 tiny needles made of platinum and iridium record her brain waves.
The 20-year-old volunteer has epilepsy, and her doctors stuck those 2-millimeter-long bits of metal—each studded with up to 18 electrodes—into the front and left side of her brain in hopes of locating the origin point of her seizures. But that bit of neural micro-acupuncture is also a lucky break for a separate team of researchers because the electrodes are in contact with parts of her brain responsible for the production and articulation of spoken words.
That’s the cool part. After the woman talks (that’s called “overt speech”), and after a computer algorithmically equates the sounds with the activity in her brain, the researchers ask her to do it again. This time she barely whispers, miming the words with her mouth, tongue, and jaw. That’s “intended speech.” And then she does it all one more time—but without moving at all. The researchers have asked her to merely imagine saying the words.
It was a version of how people speak, but in reverse. In real life, we formulate silent ideas in one part of our brains, another part turns them into words, and then others control the movement of the mouth, tongue, lips, and larynx, which produce audible sounds in the right frequencies to make speech. Here, the computers let the woman’s mind jump the queue. They registered when she was think-talking—the technical term is “imagined speech”—and were able to play, in real time, an audible signal formed from the interpolated signals coming from her brain. The sounds weren’t intelligible as words. This work, published at the end of September, is still somewhat preliminary. But the simple fact that they happened at the millisecond-speed of thought and action shows astonishing progress toward an emerging use for brain computer interfaces: giving a voice to people who cannot speak.
That inability—from a neurological disorder or brain injury—is called “anarthria.” It’s debilitating and terrifying, but people do have a few ways to deal with it. Instead of direct speech, people with anarthria might use devices that translate the movement of other body parts into letters or words; even a wink will work. Recently, a brain computer interface implanted into the cortex of a person with locked-in syndrome allowed them to translate imagined handwriting into an output of 90 characters a minute. Good but not great; typical spoken-word conversation in English is a relatively blistering 150 words a minute.
The problem is, like moving an arm (or a cursor), the formulation and production of speech is really complicated. It depends on feedback, a 50-millisecond loop between when we say something and hear ourselves saying it. That’s what lets people do real-time quality control on their own speech. For that matter, it’s what lets humans learn to talk in the first place—hearing language, producing sounds, hearing ourselves produce those sounds (via the ear and the auditory cortex, a whole other part of the brain) and comparing what we’re doing with what we’re trying to do.
The problem is, the best BCIs and computers can take a lot longer to go from brain data to producing a sound. But the group working with the Dutch-speaking woman did it in just 30 milliseconds. Granted, the sounds their system produced were unintelligible—they didn’t sound like words. If that improves, in theory that loop should be fast enough to provide the feedback that would let a user be able to practice on such a device and learn to use a system better over time, even if they can’t make audible sounds themselves. “We have this super limited data set of just 100 words, and we also had a very short experimental time so we weren’t able to provide her with ample time to practice,” says Christian Herff, a computer scientist at Maastricht University and one of the lead authors of the new paper. “We just wanted to show that if you train on audible speech, you can get something on imagined speech as well.”
Neuroscientists have been working on getting speech signals out of people’s brains for at least 20 years. As they learned more about how speech originates in the brain, they’ve used electrodes and imaging to scan what the brain did while a person was speaking. They’ve had incremental successes, getting data that they could turn into the sounds of vowels and consonants. But it isn’t easy. “Imagined speech, in particular, is a hard thing to study and a hard thing to get a good grasp on,” says Ciaran Cooney, a BCI researcher at Ulster University who works on speech synthesis. “There’s an interesting debate there because we have to figure out how close the relationship between imagined speech and overt speech is if we’re going to be using overt speech to validate it.”
It’s tricky to interpolate only signals from the parts of the brain that formulate speech—most notably the inferior frontal gyrus. (If you stuck a knitting needle straight through your skull just above your temple, you’d poke it. [Don’t.]) Imagined speech isn’t just your mind wandering, or your interior monologue; it’s probably more like what you hear in your mind’s ear when you’re trying to think of what to say. The way the brain does that may be different—syntactically, phonologically, in its pacing—from what actually comes out of your mouth. Different people might encode information in those parts of the brain idiosyncratically. Also, before the mouth does any work, whatever the language parts of the brain have sorted out has to make its way to the premotor and motor cortices, which control physical movement. If you’re trying to build a system to be used by people who can’t speak, they don’t have their own words to aim for, to validate that the system is synthesizing what they want to say. Every BCI-assisted prosthetic requires that kind of validation and training. “The problem with imagined speech is that we don’t have an observable outcome,” Herff says.
In 2019, a team based at UC San Francisco came up with an elegant workaround. They asked their subjects to speak and recorded signals from not only the parts of the brain responsible for coming up with words—the inferior frontal cortex—but also the regions that control the movement of the mouth, tongue, jaw, and so on. That’s the ventral sensorimotor cortex, sort of up and back from where you did not stick in that knitting needle. The team built a machine learning system that could turn those signals into a virtual version of the mechanical movements of speech. It could synthesize intelligible words, but not in real time. This approach is called an open-loop system.
Led by UCSF neuroscientist Eddie Chang, that team—scientific competitors to the team working with the Dutch-speaking woman, and with funding from the company that used to be called Facebook—has since published another striking success. In July, they showed how they’d embedded electrodes in and around the cortical speech centers of a person rendered speechless after a stroke. After a year and a half of training, they had a system that could pick up the intention to say any of 50 words. With the help of an algorithm that could predict which ones were most likely to follow others, it let the person speak, via a speech synthesizer, eight-word sentences at about 12 words per minute. It was the first real test of how well a person with anarthria could use a system like this. The resulting synthetic speech still wasn’t in real time, but better computers mean faster turnaround. “We were able to use his mimed, whispered signals to produce, and to decode the language output,” says Gopala Anumanchipalli, a computer and neural engineer at UCSF and UC Berkeley who worked on the research. “And we are right now in the process of generating speech, in real time, for that subject.”
That approach, focusing on a 50-word lexicon, gave the Chang team’s work better accuracy and intelligibility. But it has some limitations. Without a feedback loop, the user can’t correct a word choice if the computer gets it wrong. And it took 81 weeks for the person to learn to produce those 50 words. Imagine how long it’d take to get to 1,000. “The more words you add to that system, the more the problem becomes untenable,” says Frank Guenther, a speech neuroscientist at Boston University who didn’t work on the project. “If you go to 100 words, it gets much harder to decode each word, and the number of combinations gets much higher, so it’s harder to predict. A full vocabulary, most people use thousands of words, not 50.”
The point of trying to build a real-time system like the one Herff’s group is trying to put together—a “closed loop”—is to let users eventually make not words but sounds. Phonemes like “oh” or “hh,” or even syllables or vowel sounds, are the atomic units of speech. Assemble a library of neural correlates for those that a machine can understand, and a user should be able to make as many words as they want. Theoretically. Guenther was on a team that in 2009 used a BCI implanted in the motor cortex of a person with locked-in syndrome to give them the ability to produce vowel sounds (but not complete words) with just a 50-millisecond delay, good enough to improve their accuracy over time. “The idea behind a closed-loop system was to just give them the ability to create acoustics that could be used to produce any sound,” Guenther says. “On the other hand, a 50-word system would be much better than the current situation if it worked very reliably, and Chang’s team is much closer to the reliable decoding end of things than anyone else.”
The endgame, probably half a decade away, will be some unification of accuracy and intelligibility with real-time audio. “That’s the common direction all of the groups doing this are going toward—doing it in real time,” Anumanchipalli says.
Bigger and better electrode arrays might help. That’s what Meta, formerly Facebook, is interested in. So is Elon Musk’s company Neuralink. More data from the speech-forming areas of the brain might help with making synthetic phonemes intelligible in real time and determining whether every person’s brain does this work in roughly the same way. If they do, that’ll make the training process on individual BCIs easier because every system will start with the same baseline. That would make the learning process into something more akin to seeing a cursor move in the right direction and figuring out—through biofeedback processes that no one really understands yet—how to do it better and more reliably.
But if that’s not possible, better algorithms for understanding and predicting what a brain is trying to do will get more important. Purpose-built electrode arrays placed, neurosurgically, in the exact right place for speech would be great, but current research ethics rules mean that “this is very difficult in Europe,” Herff says. “So currently our focus is on using a more complex algorithm that is capable of higher-quality speech, and really focusing on the training aspect.”
Anumanchipalli’s group is converging on that target. Present-day BCIs approved for human use don’t have enough electrodes to get all the data researchers would like, though many hope future tech like Neuralink will improve on that. “It’s safe to say that we’ll always be sparse in our sampling of the brain,” he says. “So whatever the residual burden is, it has to be algorithmically compensated.” That means getting better at gathering intent, “how best to create a protocol where the subject is learning from the system and the system is learning from the subject.” That speech synthesizer of the future might take input from all kinds of other biometric streams besides electrodes in the brain—Anumanchipalli says that might include other indicators of intent or desire, like movement or even heart rate. And any new system will have to be easy enough to learn and use so that a user won’t give up on it out of fatigue or frustration. “I think we are very close. We have all these proofs of principles now,” Anumanchipalli says. “Progress has been slow, but I think we’re zeroing in on the right approach.” Imagined speech might not be imaginary forever.
Updated 11/10/2021 3:20 ET: A previous version of this story quoted Gopala Anumanchipalli saying research subjects “mind-whispered” words. He said the words were “mimed” and whispered.