How Baidu Created 'DeepVoice,' Its Incredible Text-to-Speech Algorithm

When future androids possess human-sounding voices, know the first steps are being taken today. Voice synthesis is one of the most intensive tasks in modern computing: to sound realistic, it needs to produce its sounds at about 48 kilohertz, forcing it to make a decision about what sound to play next every 20 microseconds or so. Google’s DeepMind development house has been the world’s best, but even it couldn’t approach the speed and accuracy necessary to turn text into sound, in real time.

Now, Baidu, the Chinese search giant — China’s Google, essentially — has announced an innovation from its A.I. development house: a real-time speech synthesis algorithm that seems to blow DeepMind’s approach out of the water.

Called DeepVoice, its new approach is capable of creating near-human sound quality in real time — more than 400 times faster than DeepMind’s previous best.

That resulted in a machine taking text and making this audio, but there’s a disclaimer: “The Baidu sample had the chance to train on a ground recording of someone saying that sentence which gives it a far more human-like quality.” That said, this is still incredible:

As the name implies, Baidu’s achievement originates with DeepMind, and its seminal paper on the WaveNet speech synthesis program. Some older approaches to text-to-speech (TTS) simply jammed together warped versions of prerecorded phonemes — the sub-sounds that represent common syllables in a particular language — but it was WaveNet that made the first real achievement in generating all new audio according to specific rules.

Baidu’s innovation was to throw out every part of the WaveNet pipeline not already based on the machine learning approach. When faced with a sentence in written text, DeepVoice first identifies the phonemes, and then predicts the pitch and duration of each, and generates the sound accordingly. That’s precisely the way DeepMind was already doing it, but by removing the classical algorithms tasked with some of those steps and replacing them with self-trained deep learning algorithms, the Chinese search giant was able to achieve an even more human sound in a much shorter span of time.

How much shorter? As mentioned, at about 48 kilohertz DeepVoice has less than to pick, tailor, generate, and play each portion of the overall sound about 50,000 times each second. WaveNet previously needed minutes to generate just a single second of audio, while DeepVoice needs, potentially, just that one second, itself.

Since the pitch and duration of the word can be adjusted, the apparent tone and emotion of the line can be tailored, as well. Right now, there is no sentiment analysis going on as part of speech synthesis — but until very recently, there was no need for it. With this level of fidelity available, the next big step forward might very well be in identifying that some paragraphs should sound sad, and others angry. With the ability to fluidly change accents, it might one day be able to read in different voices for different characters in an e-book.

The company is working on other A.I. projects, too, including Baidu self-driving cars. Such ideas are only conceivable, however, because of Baidu’s biggest advantage of all: adaptability. By achieving every step in the process with machine learning, the researchers largely removed their own human limitations from the process of updating its understanding of language. While it takes weeks of work to tailor a traditional TTS algorithm so it can speak in a new voice, DeepVoice can learn everything on its own, at the speed of a computer. Given only the appropriate training materials, new accents and even languages could be only hours away.

Also, remember that voice command programs work by first generating a text version of the spoken sentence — meaning that by combining voice recognition, translation, and speech synthesis, it might not be long before it’s possible to travel to a foreign country with a real-time robot interpreter on your lapel.

You’d still need a data connection for the foreseeable future, since even DeepVoice can’t run in real time on a smartphone, but given access to a remote server, this is one of those innovations that could spread to truly change how we interact with the world.