Can synthesized speech be generated from brain activity?
You're undoubtedly familiar with the late astrophysicist, Stephen Hawking. While his condition left him unable to speak, he could communicate through a device, using eye and facial movements to compose each word, letter by letter. New research is allowing faster speech synthesis technology to be developed that more closely simulates the natural flow of speech. So how does this “brain decoder” work?
Though we’re largely unaware of it, speaking requires very precise, multidimensional coordination and control of the vocal tract’s articulatory muscles, which extend from the glottis to lips. The process of speaking is based on a set of complex, simultaneous, and fluid movements that correspond with activity in the brain. In this research led by G.K. Anumanchipalli from the Department of Neurological Surgery et the University of California, San Francisco, the first step was to perform a true brain mapping of sounds to vocal tract anatomy. To do so, the scientists recorded and analyzed brain activity from five participants (epileptic patients) while they read a hundred statements out loud. In parallel, the researchers studied the movement of the patients’ vocal tracts (lips, tongue, jaw, larynx). Through their observations, the researchers identified the brain signals that coordinate sound articulation and associated them with the movements necessary for pronunciation. From there, they reasoned that: “if these speech centers in the brain are encoding movements rather than sounds, we should try to do the same in decoding those signals.”
This is how the scientists began creating a neural decoder that uses representations encoded in cortical activity to turn them into audible speech (spoken by a synthetic voice). They used their initial observations to reverse the process described above: producing speech from brain activity using an algorithm. The resulting audio files were made public and, using closed vocabulary tests, listeners were easily able to identify and transcribe the speech synthesized from cortical activity. Hundreds of listeners were able to understand 70% of the 101 synthetic utterances. The listeners identified the words from a list of 25 options and perfectly transcribed 43% of the sentences. Note, however, that when the list of choices was doubled (50 options), listeners correctly identified more than 47% of the words and perfectly transcribed 21% of the sentences.
G.K. Anumanchipalli and his colleagues also asked one of the participants to pronounce the sentences, and then simulate them (saying them only in their head). The test proved conclusive: the decoder was able to synthesize the silent speech.
The authors acknowledge that there is still a long way to go before we will be able to perfectly transcribe spoken language, stating: “We’re quite good at synthesizing slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy.”
Still, this new neural speech prosthesis technology generates speech at rates much closer to natural speaking (about 150 words / minute) and represents a serious possibility for restoring spoken communication in people who have lost their voices. “We are hopeful that one day people with speech disabilities will be able to learn to speak again using this brain-controlled artificial vocal tract,” says study co-author J.Chartier.
Though we’re largely unaware of it, speaking requires very precise, multidimensional coordination and control of the vocal tract’s articulatory muscles, which extend from the glottis to lips. The process of speaking is based on a set of complex, simultaneous, and fluid movements that correspond with activity in the brain. In this research led by G.K. Anumanchipalli from the Department of Neurological Surgery et the University of California, San Francisco, the first step was to perform a true brain mapping of sounds to vocal tract anatomy. To do so, the scientists recorded and analyzed brain activity from five participants (epileptic patients) while they read a hundred statements out loud. In parallel, the researchers studied the movement of the patients’ vocal tracts (lips, tongue, jaw, larynx). Through their observations, the researchers identified the brain signals that coordinate sound articulation and associated them with the movements necessary for pronunciation. From there, they reasoned that: “if these speech centers in the brain are encoding movements rather than sounds, we should try to do the same in decoding those signals.”
This is how the scientists began creating a neural decoder that uses representations encoded in cortical activity to turn them into audible speech (spoken by a synthetic voice). They used their initial observations to reverse the process described above: producing speech from brain activity using an algorithm. The resulting audio files were made public and, using closed vocabulary tests, listeners were easily able to identify and transcribe the speech synthesized from cortical activity. Hundreds of listeners were able to understand 70% of the 101 synthetic utterances. The listeners identified the words from a list of 25 options and perfectly transcribed 43% of the sentences. Note, however, that when the list of choices was doubled (50 options), listeners correctly identified more than 47% of the words and perfectly transcribed 21% of the sentences.
G.K. Anumanchipalli and his colleagues also asked one of the participants to pronounce the sentences, and then simulate them (saying them only in their head). The test proved conclusive: the decoder was able to synthesize the silent speech.
The authors acknowledge that there is still a long way to go before we will be able to perfectly transcribe spoken language, stating: “We’re quite good at synthesizing slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy.”
Still, this new neural speech prosthesis technology generates speech at rates much closer to natural speaking (about 150 words / minute) and represents a serious possibility for restoring spoken communication in people who have lost their voices. “We are hopeful that one day people with speech disabilities will be able to learn to speak again using this brain-controlled artificial vocal tract,” says study co-author J.Chartier.
Source: Gopala K. Anumanchipalli, Josh Chartier, Edward F. Chang. Speech synthesis from neural decoding of spoken sentences, in Nature, April 2019 // University of California at San Francisco website: Synthetic Speech Generated from Brain Recordings