Just Say “Ah”: Improvements in Voice Synthesis

On stage, Dr. Rupal Patel is a commanding presence. She speaks clearly and passionately about her work. Patel is an associate professor in the Department of Speech Language Pathology and Audiology at Northeastern University and the creator of the VocaliD project. She leads a team of researchers that is developing a system to create personalized synthetic voices for people with speech impairments, called target talkers by the folks at VocaliD. Rather than continue with the limited options for synthetic voices currently available, Patel imagines a world where people in need of synthetic voices can have them tailored to the qualities of their own speech.

Synthetic voices aren’t exactly new to science, but the technology also hasn’t progressed much in the past several decades. While a voice can now be played from a smartphone rather than a separate, bulky machine, the options for voices remain much the same. The most common voice (nicknamed Perfect Paul) is used because it is easy to understand in a loud environment; this is the voice Stephen Hawking uses. It also may be the voice a speech-impaired teenage girl uses—even though the voice is recognizably an adult male. Patel’s research focuses on combining a sample of the sound of a target talker’s voice with sounds derived from a voice donor. This allows her to build a voice that has the clarity, breathiness and pitch of the target talker’s voice, but with the articulation of the donor’s speech.

Patel can use as little as one vowel sound from the target talker and a few hours of speech from a donor to build a synthetic voice that matches the target talker’s voice better than any of the currently available synthetic voices. The donor is recorded while reading a few hours of sentences to make sure that an example of all possible sound combinations is included in the sound database for the new voice. Next, characteristics of the target talker’s voice are added to the program. The program can then produce articulate speech that sounds like the target talker’s voice. Patel also helped found The Human Voicebank Initiative, a growing body of speech recordings intended to contribute to research and development of better synthetic voices and other speech pathology research.

Voices built to suit the people using them would direct a listener’s attention more to what the individual is saying rather than be a source of distraction by not matching the speaker’s appearance. While a young girl speaking with an adult woman’s voice seems discordant, a girl whose voice sounds the right age hardly registers. There are intangible benefits as well. Patel closes her TED talk (see Patel’s video on the home page) with a brief anecdote about the first little boy for whom she built a voice; she tells the audience that one of the first things he said with his new voice was, “Never heard me before.” With that, she bows offstage, allowing the power of the boy’s message to demonstrate her own.

Did You Know?

The earliest speech synthesizers, built in the late 1700s, were mechanical. A Viennese man, Wolfgang von Kempelen, made a synthesizer that made sounds by using a bellows to press air through a tube with a reed at the end, like a clarinet. He attached a flexible leather tube to the end of that, and by changing the shape of the leather with his hand, he could change the sounds produced. Since the shape of the leather—acting like a person’s lips and mouth—altered the sound his machine made, Kempelen’s work displayed that it was the vocal tract that controlled speech, and not the larynx as had been believed previously.