Samantha Schaevitz was in the home stretch of a fellowship at Huridocs, a human rights nonprofit, when she got the call. Schaevitz works on site reliability engineering at Google; they’re the ones who keep steady the ship when things get choppy. And by February of this year, as large portions of Asia shut down in an attempt to slow the spread of the novel coronavirus, Google Meet found itself taking on water. They needed Schaevitz back at work.
Google launched Meet in 2017 as an enterprise-focused alternative to its Hangouts chat service. (Google has been steadily phasing out Hangouts and pushing users to Meet and Chat, part of its forever-muddled messaging platform strategy.) As the coronavirus spread and more countries issued stay-at-home orders, people flocked to video chat services for work and to check in on family and friends. Google saw Meet undergo 30-times growth in the early months of the pandemic; soon enough, the service was hosting up to 100 million meeting participants each day. That's a lot.
Amid all the profound changes people have made in response to Covid-19, the infrastructure that undergirds the internet experienced a shift in usage patterns, too, as people traded office hours for home isolation. The companies that handle these systems have mostly been able to manage users' new needs. “You essentially took the peak and extended it over a far longer period of the day,” says Ben Treynor Sloss, Google vice president of engineering. “The usage went way up, but it was mostly that the use looked more like peak most of the day, rather than that the peaks went up dramatically.” Some services, though, saw usage spike well beyond normal.
Google prepares for emergencies on a regular basis through its disaster and incident response tests, or DIRT. In these exercises, around 10,000 employees at a time will simulate handling some sort of crisis, ranging from a localized natural disaster to a Godzilla attack. The Covid-19 pandemic, though, turned out to exceed even the company’s most dramatic scenarios.
“We had typically simulated a regional-level event,” says Treynor Sloss. “We’d never done DIRT for a global-level event, in part, if I’m being honest, because it didn’t seem likely.” There was also a practical concern: Convincingly mocking up an incident with worldwide impact would risk downgrading the experiences of actual Google users, a cardinal sin in the world of DIRT.
All of which meant that Schaevitz, who led the incident response for Google Meet, and the teams involved had to figure things out on the fly. Especially as it became clear that they were taking on far more new users than their most ambitious early projections.
“In the beginning, we started planning for a doubling of our footprint, which is already huge. That’s not the normal growth curve. We soon realized that wasn’t going to be enough,” says Schaevitz. “We kept trying to make progress on building more runway … so that we would have time to figure out a solution if things would arise on a longer time horizon rather than just every day waking up and being like, what’s newly on fire today?”
Complicating the challenge was that the Google engineers involved in the response were themselves working from home, spread across four offices in three countries. “All the people who worked on this—and this is a large number of teams—even the people working on it in the same place have actually never been in the room together since this started,” says Schaevitz, who is based in Zurich, Switzerland. On a technical level that proved manageable enough; as you might imagine, Google prioritizes web-based tools that can be accessed from anywhere. But coordinating the 24-hour-a-day operation remotely required setting up redundancies for more than just bandwidth. In a blog post detailing the response, Schaevitz described how everyone in an incident response role was assigned a “standby,” basically an understudy who could step in if the principal got sick or had to take time away. (An especially prudent measure during a global health crisis.)
At the same time Schaevitz and her team were frantically building out the runway, Google’s product teams were opening up the throttle. Despite the company’s built-in scale advantage, its upstart competitor Zoom had become an early favorite among the homebound masses, jumping from 10 million daily meeting participants in December to 300 million by April. The company relied mostly on Amazon Web Services to accommodate the growth. “Zoom was never designed to be a consumer-grade product,” says Alex Zukin, an analyst at RBC Capital. “But it was designed to be so easy to use, and so good to be on from a user experience perspective, that when the pandemic hit and everybody went home, they just started using Zoom for other things.”
Google understandably wanted people to use its tools, too. And so as millions of people flocked to video chat for that simulacrum of normalcy, Google promoted Meet in its already ubiquitous products like Gmail and Calendar. It started offering free access to what had been paid tiers of Meet in March; features like meetings with up to 250 participants and the ability to record and save calls won't go back behind a paywall until September. In April, as usage continued to balloon, it added features like gallery view in an attempt to keep up with Zoom.
“We definitely were aware of these launches as they were being planned, and the concerns of capacity and how much growth we could take in which regions at which points in time was definitely considered,” says Schaevitz. One advantage of being Google is that capacity isn’t generally a problem. The company has 20 massive data centers scattered around the world. Because Meet is a global product, spikes in Europe can be smoothed out by borrowing some capacity in other regions’ off-hours, and so on.
Which doesn’t mean keeping Google Meet’s lights on was an easy ask. To make sure Meet sessions didn't just happen but went off smoothly, the teams needed to squeeze more resources out of fewer servers. They also had to adjust to changes not just in session volumes but characteristics; they now tended to last longer, and involved more participants, than in pre-pandemic times.
At the same time, they built in “fire escapes” to prepare for unexpected surges. They engineered a way to quickly downgrade a new Meet participant’s stream from high- to low-definition, buying time to make efficiency and provisioning adjustments in a worst-case scenario. They also brought in automation experts from other parts of Google to help identify and implement ways to take some of the burden off the human responders.
Even with all the resources of a company like Google, Schaevitz and her colleagues needed plenty of creativity and improvisation to solve the task at hand. Given the unpredictability of Covid-19, they’ll likely continue to do so—just with a much longer runway. “We’re in a place,” says Schaevitz, “where I think we have at least the foundations of the plans we need to be able to handle this going forward.”