Culture

Google Tone/Transfer turns any audio into an instrument

"Deconstruct sound into playdough."

An African American jazz musician is playing the saxophone in the studio on a neon background.

Shutterstock

Oct. 2, 2020

If you're in the mood for a fun way to kill some time and learn a bit about machine learning and artificial intelligence in the process, Alphabet's Google has an audio treat for you. The company's Tone/Transfer is a website where you can upload an audio recording of practically anything under the sun — from your voice to pots and pans clanging together to any other aural detritus you choose — and, through machine learning, the model transforms the input into virtual instruments like violins, trumpets, saxophones, and flutes... or, at least, close approximations fo them.

The company frequently experiments with machine learning and how neural networks adapt to different datasets and interpret them. The results for Tone/Transfer depend on the input, obviously. If you're recording something melodious, the output is likely to be equally pleasant to listen to. I, however, decided to take things in a different direction. After singing the English alphabet in my recording, Tone/Transfer took the input and transformed it into a trumpet (per my request). The website notes, "This [trumpet] model recreates the tone of the trumpet mixed with the player’s breathing sounds." It makes it realistic — and hilarious. Check it out:

It gets better. Here's how the neural network interpreted my clapping and turned it into a violin tune.

How it works — Head over to Tone/Transfer and record yourself either through the website or through an Android phone. Once the recording is complete, you will have the option to transform your audio into a flute, violin, trumpet, or saxophone. If you click on each instrument, Google provides additional information about the training data involved. It's entertaining and educational.

In this blog post, you can learn more about the neural network and how it's trained on interpretable models (like your voice) to give its own interpretation. By taking simple interpretable signals, the neural network creates complex output. The first step is providing user input to the neural network. The network then relies on Digital Signal Processing (DSP) to create output based on different frequencies and tuning. That's how my voice while singing the English alphabet and my clapping in the other recording were turned into trumpet and violin sequences respectively.

People dig it — The verdict is here and it looks like Tone/Transfer is impressing (and weirding out) its audience. Especially studio artists.