Science

Adobe's "VoCo" Can Put Words in Your Mouth

November 7, 2016

Adobe, arguably best-known for Photoshop, is stepping further into the audio editing game, beyond its Audition program.

Adobe’s Zeyu Jin demoed VoCo, which in essence is Photoshop for audio, on Friday at Adobe MAX 2016. In a nutshell, VoCo allows users to quickly rearrange words in a recorded speech, and even to write in new words and phrases. But when the finished speech plays back, it won’t sound like a robot: It’ll sound like the speaker, because it understands how the speaker talks.

In the future, when this product hits the market, someone could make you — or anyone, for that matter — say just about anything. So, in a year or so, when your boss asks you why she heard a five-minute clip in which you harped on about the health benefits of crack cocaine and crystal meth, simply blame Adobe.

The ability to rearrange words in a speech is nothing new. Adobe just made the process exceptionally easy: Now it’s as simple as cutting a word from a sentence, then pasting it elsewhere in that sentence. A friendly user would just have fun Yodafying her friends (“Go to Applebee’s tomorrow shall we?”), or fixing up a blunder in her podcast. But an unfriendly user — for example — could take a politician’s speech, manipulate it into a rabble-rousing tirade, and divide our splintered nation. (Then again, that might be redundant.)

Adobe says it’s working on a preventative measure against such nefarious uses of its VoCo tech — a sort of audio watermark, like those you might see on stock photos — but only time will tell how long those work.

Did Trump say that, or did a computer-generated imitation say it?

Getty Images / Chip Somodevilla

The system needs just 20 minutes of a person’s speech to understand how to replicate it. After that, the Audioshopper (if you will) need only type in what he wishes his speaker to say. As the video below demonstrates, at about four-and-a-half minutes, the simulation is impressive: While there may be some hint of a robot in there, it’s far more human than you’d presume. (Thanks, no doubt, to advances in natural language understanding.) And since VoCo is, for now, just a demo, there’s no telling how much better it will get. Perhaps soon, it will function with just five minutes of speech, and there will be no way to tell its output from the real thing. A bit like what Photoshop did to photography, and to photographers.

A future in which it’s possible to make anyone say anything is a little bit daunting, but, with Adobe as our guide, it’s where we’re headed. We must already be wary of fact and photo manipulations and deceptions; now, we can add audio recordings to the list of things we can no longer trust. But all is not lost: VoCo is sure to be fun, too. For instance: You could make Drake admit, in speech, that he Photoshopped his album cover.

Related Tags