When James Clarke went to work at London’s legendary Abbey Road Studios in late 2009, he wasn’t an audio engineer. He’d been hired to work as a software programmer. One day not long after he started, he was having lunch with several studio veterans of the 1960s and ’70s, the pre-computer era of music recording when songs were captured on a single piece of tape. To make conversation, Clarke asked a seemingly innocent question: Could you take a tape from the days before multitrack recording and isolate the individual instruments? Could you pull it apart?
The engineers shot him down. It turned into “several hours of the ins and outs of why it’s not possible,” Clarke remembers. You could perform a bit of sonic trickery to transform a song from one-channel mono to two-channel stereo, but that didn’t interest him. Clarke was seeking something more exacting: a way to pick apart a song so a listener could hear just one element at a time. Maybe just the guitar, maybe the drums, maybe the singer.
“I kept saying to them that if the human ear can do it, we can write software to do it as well,” he says. To him, this was a challenge. “I’m from New Zealand. We love proving people wrong.”
The challenge dropped him at the leading edge of a field known as upmixing, in which software and audio engineers work together to transform old recordings in ways that were once unthinkable. Using machine learning, engineers have made inroads into “demixing” the voices and instruments on recordings into completely separate component tracks, often known as stems. Isolating the components of songs is a surprisingly hard problem—more like unswirling paint than using a pair of scissors. But once engineers have stems, they can take the isolated tracks and “upmix” them into something new and perhaps improved. They might enhance a muffled drum track on an old recording, produce an a capella version of a song, or do the opposite and remove a song’s vocals so it can be used as background in a TV show or movie.
As an Abbey Road employee, it was only natural that Clarke would soon focus his experimentation on Beatles songs. But he wasn’t the only one trying to pull apart old music. Around the world, other audio aficionados were tackling the same challenge with their own favorite tracks—and converging on some of the same methods. In the years since Clarke’s fateful lunchtime chat, the number of apps and tools for splitting songs has exploded, as has the community of academics and enthusiasts that surround the practice. For creators of sample-based music, demixing is conceivably the greatest sonic invention since the digital sampler that fueled the explosion of hip hop four decades ago. For karaoke fans, it’s a game changer. For the people (or private equity firms) who own the rights to classic but inferior recordings—or enthusiasts willing to wade into legal gray areas—upmixing presents a whole new way to hear the past. After decades of slow advancement, deep learning has now sent both technologies into overdrive. The uncanny valley is alive with the sound of music.
New Old Sounds
Two decades ago, one of the first people to experiment with demixing was Christopher Kissel, a professional electronics test engineer from Long Island. Kissel didn’t have easy access to a recording studio. But he was a lifelong music fan, and he dreamed of making old tracks sound new.
Until the 1960s, almost all popular music was recorded and listened to monaurally—all the instrumental and vocal parts were recorded onto a single track of tape and played back through a single speaker. Once a song was on tape, it was basically finished. But Kissel had an inkling that it might be possible to update old mono recordings in a profound way.
In 2000 he purchased his first Mac for the express purpose of transforming single-track pop songs from the ’50s and ’60s into two-channel stereo versions fit for headphones or properly separated speakers. “Compared to mono, stereo sounds more lifelike and allows you to more viscerally hear and appreciate the interplay between the musicians,” he says. Most listeners probably prefer their music on two speakers (or headphones), and Kissel cared enough to try forcing the old recordings into stereo.
The first night with his new Mac, Kissel used floppy disks to install an early digital audio workstation called sonicWORX. It was the only software capable of running Pandora Realtime, a plug-in that could selectively boost the volume of vocals on recordings. “It was very advanced for its time,” Kissel says. He wanted to see if the tool could do something more interesting. He loaded up Miss Toni Fisher’s 1959 hit “The Big Hurt” and attempted to pull it apart.
Tinkering with the song using sonicWORX’s waveform visualizer and settings, Kissel says, he was "able to separate the lead vocal, backing vocals, and strings and move them to the right side, and the rest of the backing instrumentation to the left.” It was crude and a little glitchy, but the effect was powerful. “It was quite thrilling to hear,” he says. Decades later, Kissel remains blown away by that first experience.
He experimented with more ’50s and ’60s classics, including the Del Vikings’ “Whispering Bells,” Johnny Otis’ “Willie and the Hand Jive,” and—perhaps most appropriately—the Tornados’ “Telstar,” a futuristic DIY wonder produced and composed by the influential sound engineer Joe Meek.
Kissel immersed himself in the developing fields of demixing and upmixing—though the names came later—by moderating forums and maintaining a website chronicling advances in the disciplines. He started playing around with a technique called spectral editing, which allowed people to treat sound as a visual object. Load a song into a spectral editor and you can see all of the recording’s many frequencies, represented as colorful peaks and valleys, laid out on a graph. At the time, audio engineers employed spectral editing to remove unwanted noise in a recording, but an intrepid user could also zone in on specific frequencies of an audio track and pluck them out. When a freeware spectral editing tool called Frequency popped up, Kissel decided to try it out.
He spent about 60 hours using Frequency to craft an upmix of the 1951 mono R&B hit “The Glory of Love,” by the vocal group the Five Keys, using the app to carefully target the separate vocalists and spread their voices across the stereo spectrum. With a final polish by disco legend Tom Moulton, Kissel’s friend, it became one of the first spectrally edited upmixes to get released on a commercial album, in 2005. Several labels soon began releasing collections of upmixed mono-to-stereo hits, sometimes licensed, sometimes in the public domain, and sometimes in between.
French software company Audionamix started building professional demixing software to help users pull apart tracks on their own, which made this suite of techniques more accessible. In 2007 the company unveiled a major achievement in upmixing, bringing vintage Édith Piaf recordings from mono to theater-ready surround sound for the biopic La Vie en Rose. In 2009, the company opened a Hollywood office to continue courting film, television, and commercial work.
Other times, their projects involved focused demixing. When the British online lending company Sunny wanted to use the song “Sunny” by late American R&B singer Bobby Hebb in a commercial, it found that one of the song’s original vocals interrupted the ad’s narration. With Audionamix’s help, the pesky vocals got zapped from existence. The French company also offers a service—originally called “music disassociation,” but now rebranded slightly less ominously as “music removal”—in which old television series and movies are scrubbed of music that might be too expensive to license, so they can be released in the latest format, be it DVD or streaming. According to Nicolas Cattaneo, a researcher at Audionamix, “This is the first thing that began to be really usable,” at least commercially. (Scholars studying music in old films and television shows should probably rely on releases from before 2009 or so if they want to make sure they’re hearing the original soundtracks.)
AudioSourceRE and Audionamix’s Xtrax Stems are among the first consumer-facing software options for automated demixing. Feed a song into Xtrax, for example, and the software spits out tracks for vocals, bass, drums, and “other,” that last term doing heavy lifting for the range of sounds heard in most music. Eventually, perhaps, a one-size-fits-all application will truly and instantly demix a recording in full; until then, it’s one track at a time, and it’s turning into an art form of its own.
What the Ear Can Hear
At Abbey Road, James Clarke began to chip away at his demixing project in earnest around 2010. In his research, he came across a paper written in the ’70s on a technique used to break video signals into component images, such as faces and backgrounds. The paper reminded him of his time as a master’s student in physics, working with spectrograms that show the changing frequencies of a signal over time.
Spectrograms could visualize signals, but the technique described in the paper—called non-negative matrix factorization—was a way of processing the information. If this new technique worked for video signals, it could work for audio signals too, Clarke thought. “I started looking at how instruments made up a spectrogram,” he says. “I could start to recognize, ‘That’s what a drum looks like, that looks like a vocal, that looks like a bass guitar.’” About a year later, he produced a piece of software that could do a convincing job of breaking apart audio by its frequencies. His first big breakthrough can be heard on the 2016 remaster of the Beatles’ Live at the Hollywood Bowl, the band’s sole official live album. The original LP, released in 1977, is hard to listen to because of the high-pitched shrieks of the crowd.
After unsuccessfully trying to reduce the noise of the crowd, Clarke finally had a “serendipity moment.” Rather than treating the howling fans as noise in the signal that needed to be scrubbed out, he decided to model the fans as another instrument in the mix. By identifying the crowd as its own individual voice, Clarke was able to tame the Beatlemaniacs, isolating them and moving them to the background. That, then, moved the four musicians to the sonic foreground.
Clarke became a go-to industry expert on upmixing. He helped rescue the 38-CD Grammy-nominated Woodstock–Back to the Garden: The Definitive 50th Anniversary Archive, which aimed to assemble every single performance from the 1969 mega-festival. (Disclosure: I contributed liner notes to the set.) At one point during some of the festival’s heaviest rain, sitar virtuoso Ravi Shankar took to the stage. The biggest problem with the recording of the performance wasn’t the rain, however, but that Shankar’s then-producer absconded with the multitrack tapes. After listening to them back in the studio, Shankar deemed them unusable and released a faked-in-the-studio At the Woodstock Festival LP instead, with not a note from Woodstock itself. The original festival multitracks disappeared long ago, leaving future reissue producers nothing but a damaged-sounding mono recording off the concert soundboard.
Using only this monaural recording, Clarke was able to separate the sitar master’s instrument from the rain, the sonic crud, and the tabla player sitting a few feet away. The result was “both completely authentic and accurate,” with bits of ambiance still in the mix, says the box set’s coproducer, Andy Zax.
“The possibilities upmixing gives us to reclaim the unreclaimable are really exciting,” Zax says. Some might see the technique as akin to colorizing classic black-and-white movies. “There’s always that tension. You want to be reconstructive, and you don’t really want to impose your will on it. So that's the challenge.”
Heading for the Deep End
Around the time Clarke finished working on the Beatles’ Hollywood Bowl project, he and other researchers were coming up against a wall. Their techniques could handle fairly simple patterns, but they couldn’t keep up with instruments with lots of vibrato—the subtle changes in pitch that characterize some instruments and the human voice. The engineers realized they needed a new approach. “That’s what led toward deep learning,” says Derry Fitzgerald, the founder and chief technology officer of AudioSourceRE, a music software company.
Fitzgerald was a lifelong Beach Boys fan; some of the mono-to-stereo upmixes he did of their work, for the fun of it, got tapped for official releases starting in 2012. Like Clarke, Fitzgerald had found his way to non-negative matrix factorization. And, like Clarke, he’d reached the limits of what he could with it. “It got to a point where the amount of hours I spent tweaking the code was very, very time-consuming,” he says. “I thought there had to be a better way.”
The WIRED Guide to Artificial Intelligence
Supersmart algorithms won't take all the jobs, But they are learning faster than ever, doing everything from medical diagnostics to serving up ads.
By Tom Simonite
The nearly parallel move to AI by Fitzgerald, James Clarke, and others echoed Clarke’s original instinct that if the human ear can naturally separate the sounds of instruments from one another, it should also be possible to model that same separation by machine. “I started researching deep learning to get more of a neural network approach to it,” Clarke says.
He started experimenting with a specific goal in mind: pulling out George Harrison’s guitar from the early Beatles hit “She Loves You.” On the original recording, the instruments and vocals were all laid on a single track, which makes it nearly impossible to manipulate.
Clarke started building an algorithm and trained it on every version of the song he could find—radio sessions, live versions, even renditions by tribute bands. “There were quite a few different ones, so plenty of examples to understand how the track should sound,” Clarke says. Using spectrograms, he now also knew how the track should look. The algorithm broke up the audio into individual stems, one for each instrument, but Clarke only had eyes and ears for Harrison’s Gretsch Chet Atkins Country Gentleman guitar.
Over nine months, Clarke sifted through the guitar part a few seconds at a time, virtually hand-cleaning the track phrase by phrase. He listened for stray audio artifacts from other instruments and used spectral editing software to find and eliminate them. For the final step, he set out to recapture the track’s original ambience. That part was easy. As an Abbey Road employee, he could book time in the vaunted Studio Two, where “She Loves You” was originally recorded. He played his track into the room through the in-house speakers and recorded it anew, to capture some of the subtleties of the room’s well-preserved acoustics. In August 2018, Clarke showed off his AI demixing work publicly for the first time.
The occasion was a sold-out lecture series that offered a rare chance for fans to step inside Studio Two, where the Beatles, Pink Floyd, and plenty of others recorded. Visitors were invited to re-create the clattering E-major chord that ends “A Day in the Life” by playing the studio’s pianos at the same time. The audience also received a glimpse of the future.
In front of a packed audience, Clarke played the Beatles’ original 1963 recording of “She Loves You.” Then, to pin-drop silence, he played what should have been impossible: the same recording with everything removed except for Harrison’s guitar.
Three days later, excerpts of Clarke’s demo made their way onto the web. The truthers quickly descended. Disbelieving audiophiles started trashing Clarke in online forums. “I think it’s a shame that the demonstration to show how good this new technology is happens to be false,” a user who went by Beatlebug wrote.
“It's kind of sad that Abbey Road has to mislead people like that,” RingoStarr39 posted in the same thread.
Beatlebug, RingoStarr39, and others insisted that the audio segment in Clarke’s lecture was a more easy-to-isolate bit from a later German version of the song, “Sie Liebt Dich,” recorded in stereo. They insisted that James Clarke was a charlatan.
But Clarke had merely demonstrated a proof of concept. Perfecting Harrison’s guitar track of “She Loves You” took him approximately 200 hours. He hadn’t even attempted to isolate John Lennon’s guitar. “Not a viable option for projects,” he admits. It was far from automated. But it could be done. And it would be.
Up, Up, and Away
The dam broke fully when French streaming service Deezer released an open source code library called Spleeter that allowed both casual and professional programmers to build tools for demixing and upmixing. Anybody comfortable enough with their computer’s command line interface could download and install software from Github, select an audio file of their favorite song, and generate their own set of isolated stems. People started putting the code library to creative use. When tech blogger Andy Baio played around with it, he was delighted to discover how easy it now was to create mashups, such as when he crossed the Friends theme and Billy Joel’s “We Didn’t Start the Fire.” “Nobody should have this kind of power,” he tweeted.
The first generation of users are demixing and upmixing in creative ways. Some musicians are removing one instrument from a song to create tracks they can practice along to or to generate source material for new music. Podcast producers are cleaning up dialog recorded in noisy environments. Hobbyists are using iPad apps and free sites to create their own mixes or make any song karaoke-ready. Several streaming services in Japan now offer vocal removal in officially licensed form, including Spotify’s SingAlong, where listeners can turn down a song’s vocals, and Line Music, which promises real-time source separation.
Along with established players (Audionamix, James Clarke), the newest company offering professional demixing services is the California-based startup Audioshake.
The company will soon launch a service where music rights holders—both musicians and labels—can upload their tracks to the cloud and, within minutes, download high-quality stems ready for licensing in film, broadcasting, video games, and elsewhere. Audioshake claims best-in-field ratings for drums, bass, and vocals, according to benchmarks established by the Signal Separation Evaluation Campaign, an organization made up of audio researchers who track the progress of demixing techniques.
But Audioshake is also the first company to figure out how to automatically isolate guitars—or, more precisely, a single guitar. The company is tight-lipped about how it achieved this. “We refined the architecture of our deep-learning network to be specially tailored to the harmonics and timbre of the guitar,” says company AI researcher Fabian-Robert Stöter. Basically, when a user uploads a track to Audioshake, a layer in the company’s algorithm converts the song’s waveform into a numerical representation that makes it easier for the AI model to figure out where a guitar ends and everything else begins.
To see it work, I was invited to upload some songs. Within a few minutes, the company’s software was able to pull apart a track of a rock band playing in guitar-bass-drums-vocals power trio format. A track by Talking Heads’ original lineup came back with David Byrne’s 12-string acoustic guitar separated (with minimal artifacts) alongside tracks of Tina Weymouth’s bass and Chris Frantz’s drums. It works equally well on other songs in that exact guitar/bass/drums/vocals configuration. But music is huge, and the power trio format is a tight set of parameters.
Outside those parameters is the unclaimed frontier of demixing. The original recording of “She Loves You” comes back from Audioshake with Lennon's and Harrison’s guitars sounding like jangling ghosts. James Clarke’s manual work still can’t be matched by a machine. That said, Audioshake does what couldn’t be done only a few years ago, pointing to a future in which machines will recognize more instruments. They might be unbreachable frontiers. For virtually all producers since the ’60s, a recording studio has been the place to combine unusual instruments and generate wondrous new sounds (and literal overtones) explicitly designed to blend together in the listener’s ears.
But what if the artifacts turn out to be art? If a demixing attempt gone awry sounds cool to the right producer, it might become the basis for fantastic new music. Think Cher turning Auto-Tune into a pop trend with “Believe.” As archival producer Andy Zax put it, “Some 16-year-old making hip hop records on a PlayStation is going to figure out some genius use of this thing and create a sound world we've never heard before.”
For now, plenty of experimentation is happening in far-flung fan forums, with unofficial upmixes of many equally unofficial recordings. Some fans have been exploring a subgenre that might be called upfakes, fusing, say, George Harrison’s original 1968 demo for “Sour Milk Sea”' with the backing track from a more recent recording by another musician. (Fans are understandably jittery about copyright claims and generally only post their work with quickly expiring links.)
As for Clarke, he is still working on the exact AI methodology to pull apart a mono Beatles vocal track. He’s also started an independent company called Audio Research Group to work as demixer-for-hire. Lately he’s been helping to create a set of tracks for a band that lost all its master tapes and has only its LPs.
Even to Clarke, though, many recordings can’t be pulled apart, especially if the instruments are close in frequency or a recording is particularly compressed, as on a radio broadcast or many audience-sourced live recordings. He once tried to demix a 1991 R.E.M. tape from London. “There’s just not enough from a spectral point of view, it’s so squashed,” Clarke says. “You get really fuzzy results.” For now, some blurry aspects of the past are going to stay blurry. But some are going to sound brighter than ever.
Let us know what you think about this article. Submit a letter to the editor at mail@WIRED.com.