“Read my lips.” It’s a common plea made whenever someone just isn’t getting the message — and now it’s a command that the DeepMind A.I. can obey, because Google and the University of Oxford have taught the system how to recognize words and phrases based on lip movements.
DeepMind was fed “over 100,000 natural sentences from British television,” the researchers said, using a common dataset used to train A.I on speech recognition and machine translations. Analyzing this data allowed the deep neural networks at the A.I’s core to learn how to “transcribe videos of mouth motion to characters,” which in turn enabled it to understand what people were saying without relying on audio.
The system was eventually able to beat a professional lip reader at recognizing speech from BBC clips, which means it’s probably much better at lip-reading than most people. The team said these abilities could allow A.I. to understand commands even in loud environments, create subtitles for silent films, and differentiate between multiple speakers, among other things.
Being able to read lips could also help A.I. recognize speech even when audio is available. The two systems could play off each other: A machine could lip-read if there’s a problem with the audio, for example, and then use more traditional speech recognition tools if the video has an issue. The resulting A.I. would be far easier to communicate with and wouldn’t be limited to just one input method (voice) like today’s digital assistants.
More traditional advancements in speech recognition will also help. That’s why Microsoft announced in October that it taught an A.I. how to recognize speech better than humans during a test in which participants have to transcribe phone calls. (The A.I. was half a percent more accurate, which is notable even if it isn’t quite earth-shattering.)
Google has also been working to make A.I. more capable in other respects. DeepMind can now synthesize speech that sounds more human by using waveforms from raw audio instead of mixing-and-matching a bunch of pre-recorded clips like current systems. This means A.I. is getting better at speaking, too, not just listening.
On the visual end of things, DeepMind is learning more than just how to read lips. It’s also been taught how to categorize images after seeing just one example instead of having to analyze millions of images. This technique, called one-shot learning, brings A.I. even closer to learning and thinking in ways that are similar to the human brain’s.
All of these changes make A.I. smarter, which could also make it more convenient. Instead of shouting at Siri, for example, the assistant could just read our lips. Digital assistants could also get better at normal speech recognition, like Microsoft’s, or at identifying something in a photo on its own instead of searching the web. These improvements aren’t made for their own sake; we can directly benefit from them.
Photos via Google and the University of Oxford, Getty Images / Miles Willis