Science

DeepMind A.I. Bridges the Gap Between Robot and Human Voices

Sep. 9, 2016

Getty Images / Sean Gallup

Artificial intelligence just made robot voices sound a lot more realistic.

DeepMind, which previously demonstrated the power of A.I. by beating a human player at Go in March and cutting its power bill in half in July, is now focused on speech synthesis.

The A.I research group, which is part of Google parent company Alphabet, revealed this morning that it has created a new tech called WaveNet that can be used to generate speech, music, and other sounds more accurately than before.

DeepMind explains that many existing speech synthesis methods rely on “a very large database of short speech fragments are recorded from a single speaker and then recombined to form complete utterances.” WaveNet, on the other hand, uses the “raw waveform of the audio signal” to create more realistic voices and sounds.

DeepMind explains how a WaveNet is structured.

DeepMind

This means that WaveNet is working with the individual sounds created when a human speaks instead of using complete syllables or entire words. Those sounds are then run through a “computationally expensive” process that DeepMind has found “essential for generating complex, realistic-sounding audio” with machines.

The result of all that extra work is a 50 percent improvement to synthesized speech in U.S. English and Chinese Mandarin. Here’s an example of speech generated using parametric text-to-speech, which is common today, used by DeepMind to demonstrate how this speech synthesis method is lacking:

And here’s an example of the same sentence generated by WaveNet:

As companies continue their work on natural language interfaces, offering more realistic-sounding responses is going to become increasingly important. WaveNet can help solve that problem.

Related Tags