I had Google AI narrate my audiobook. The results were... not terrible.
The production of audiobooks can be a long, laborious process. Tech giants are trying to cut human narrators out of the equation.
My heart sank when I listened to the audio version of my first book, YouTubers, a study of YouTube and its impact on society.
Just a few words into the first chapter, the narrator — a bland North American actor whose voice sounded nothing like mine, which was odd for me to hear — read out a section in which I’d been mentioned in a video by YouTuber Casey Neistat.
Neistat, who’s likely to appear on YouTube influencers’ Mount Rushmore, has a surname pronounced N’eye-stat. But the narrator of my book said Nee-stat. It was a catastrophic error that I felt undercut any credibility I had, and there was nothing I could do about it because by the time I was hearing it, the audiobook was already on Amazon via Audible.
I feared that listeners with any knowledge of YouTube would hear this mistake, hit stop, and never listen to the rest, and I’ve consequently never really publicized the audiobook version.
“[Consumers] expect there to be an audio version of pretty much everything that’s released.”
So when my publisher contacted me last month about the prospect of producing an audio version of my second book, TikTok Boom, using a whizzy AI algorithm developed by Google, I was circumspect. If a human couldn’t get Casey Neistat right, what hope did an AI have with Yiming Zhang (the name of ByteDance’s founder) or Toutiao (the company’s Chinese news app)?
Audiobooks have become an increasingly important part of the publishing industry. In 2010, they accounted for two cents of every dollar the industry made; by 2020, it was nearly a dime a dollar. That’s not the whole story, however: Because audiobooks generally retail for less than print books, looking at revenue undercounts their position in the market.
The Publishers Association, a U.K. industry body, has spoken of a “steep rise” in audiobooks’ popularity accelerated by the pandemic. In 2020, 71,000 titles were released in audiobook format, up 39 percent from just a year earlier. “Audio is a very exciting market for book publishers,” says Martin Hickman, founder of Canbury Press, which published both of my books. “It’s a way of increasing revenue from already published titles.”
Patch McQuaid, founder of iD Audio, which produces audiobooks for major publishers, says the audiobook trend isn’t just driven by publishers. Consumers, he says, “expect there to be an audio version of pretty much everything that’s released.”
It makes sense to have an audiobook. And yet, they can be expensive to produce. McQuaid says that basic audiobooks start at £2,000 ($2,400) for production alone, before the actor’s charges are included, which can be anything from £75 an hour to £600, depending on the stature of the person hired. Voice actor and audiobook narrator Cromerty York tells me she effectively won’t get out of bed for less than $120 an hour. (And audiobooks require six hours of preparation on the part of the narrator for every one hour of recorded audio.)
Three of the big tech giants have either publicly said they are — or are believed to be — developing automated audiobook-reading technology.
Three of the big tech giants have either publicly said they are — or are believed to be — developing automated audiobook-reading technology. Apple and Amazon have both worked with narrators to develop voice corpuses, according to industry sources, while Google’s offering is currently available as a free beta test. With Google’s service, you can choose from more than 35 different narrator voices, submit an .EPUB file, then receive back audio tracks that can be fine-tuned.
But whether it’s any good is another question. Done right, voice actors imbue an audiobook with a sense of narrative and emotion that complements the written word. It’s a performance that is designed to enthrall.
Steven Jay Cohen, the creative director of South Hadley, Massachusetts–based audiobook production company Spoken Realms, has been an audiobook narrator for more than 30 years. “In all three cases, what’s going on is not teaching a computer how to narrate,” Cohen says, referring to the tech giants’ efforts. “It’s listening to different kinds of things, figuring out how often certain kinds of intonation changes happen in a human voice, then trying to turn that into an equation so that a computer can likely make an appropriate choice.”
To Cohen, it’s rebranded, souped-up text-to-speech software — several steps above what has been around for eons. “My Commodore 64 could do text-to-speech when I was a kid,” he says. “It’s just now they don’t sound like they’re saying, ‘Greetings, Professor Falken. Shall we play a game?’” (He’s referring to the computerized voice in the 1983 movie WarGames, in case you didn’t get the reference.)
McQuaid raises several other issues with this new narration tech. For one thing, he says that what’s marketed as a time- and money-saver may not actually be that. While the rough draft of the AI-generated voice track may be fairly good, the time spent cleaning it up to make it presentable may be more hassle than it’s worth.
McQuaid does video game voice acting and sees AI-generated voices being useful there, because games tend to involve short snippets of dialog. “I don’t want to be seen as a Luddite,” he says. “I do like new technology and stuff. However, there’s a whole thing about what an actor brings to what they do.”
But human beings are expensive to hire. “For publishers, that creates a problem, because the audiobook market is so far largely concentrated in the hands of Audible, which is a branch of Amazon,” says Hickman. And Amazon isn’t known for paying huge returns to publishers. Audiobooks make up between 10 and 15 percent of Canbury Press’ revenue per book. “For a lot of books, it doesn’t really make sense to me to spend that £1,000-plus creating an audiobook,” he says. “I won’t get it back in revenue.”
What came back was an audiobook that, while lacking some of the emotion and drama you’d hope for, sounded decent.
Hickman says the process of automatically generating narration through Google for my audiobook was relatively simple. He selected a voice type, dumped the text into Google’s website, and waited a few hours. What came back was an audiobook that, while lacking some of the emotion and drama you’d hope for, sounded decent. You can hear a sample of the plummy British accent, and how the AI copes with the writing style of my book, right here.
The AI made a passable attempt at Yiming Zhang’s name, and with clearer diction than anything I’d be able to attempt. Yet it struggled with some things. I have a predilection for em-dashes in my writing, which I aim to deploy as a conversational U-turn, taking us in one direction — or supporting a point I’ve already made. Google’s AI reader steamrolls right through them, ignoring the dash altogether. It also just keeps reading at a steady tempo, when good audiobook narrators play with pace, lingering on some scenes and racing through others.
The key question I had for Hickman: Was the automatically generated audiobook good enough to go on sale to the public? “I think it’s good enough to go on sale,” he replies. “I think it ultimately will work very well, but at the moment I’m reluctant to put the rest of the back catalog into the technology as it is just now. I’d like to wait and see what people think about the TikTok book.”
And because the Google AI isn’t programmed to self-promote (yet), I will do the dirty work: The audiobook of TikTok Boom is available for download now.