15.4 C
New York
Saturday, September 30, 2023

How an Obscure Company Took Down Big Chunks of the Internet

Early Tuesday morning, large portions of the web sputtered out for about an hour. The downed sites shared no obvious theme or geography; the outages were global, and they hit everything from Reddit to Spotify to The New York Times. (And yes, also WIRED.) In fact, the only thing they have in common is Fastly, a content-delivery network (CDN) provider whose predawn hiccup reverberated across the internet.

You may not have heard of Fastly, but you likely interact with it in some fashion every time you go online. Along with Cloudflare and Akamai, it’s one of the biggest CDN providers in the world. And while Fastly resolved Tuesday’s worldwide disruptions with relative speed, the incident offers a stark reminder of how fragile and interconnected internet infrastructure can be, especially when so much of it hinges on a handful of companies that operate largely outside of public awareness.

Special Delivery

To understand how a Fastly problem can quickly become everyone’s problem, it’s worth spending a minute on the role CDNs play in the internet ecosystem. While it’s tempting to think of the internet as amorphous—they even call it “the cloud”—the articles you read, the movies and songs you stream, the photos you post, they all live on physical servers. And while that content might be primarily hosted on a cloud provider, you still need a way to get it to people quickly and efficiently. 

That’s where a CDN comes in. By operating servers around the globe, CDNs can whittle down the distance between your smartphone and the internet experience of your choice. Think of it as the internet’s equivalent of a relay man in baseball: Rather than try to throw the ball to home plate on their own, an outfielder will instead toss it to an infielder, who in turn fires it to the catcher. It’s faster and more efficient.

“It basically enables really high performance for content, whether that’s streaming video or a site or all the little images that pop up when you go to an ecommerce site,” says Angelique Medina, director of product marketing at the network monitoring firm Cisco ThousandEyes. “Serving it really close to the user takes away a lot of the load time, and it enables everyone to have a really great experience when they’re surfing the web.”

Take this article that you’re reading right now. Chances are you’re reading a copy of it held in the cache of what's known as a “point of presence,” a server somewhere in your region. A Fastly network map indicates that the company operates POPs in at least 58 cities around the world, including multiples in densely populated areas like Los Angeles, London, and Singapore. It lists their combined global capacity at a whopping 130 terabits per second.

And that’s not all! CDNs don’t just store content closer to the devices that crave it. They also help direct it across the internet. “It is like orchestrating traffic flow on a massive road system,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a principle architect at Akamai. “If some link on the internet fails or gets congested, CDN algorithms quickly find an alternate route to the destination.”

So you can start to see how when a CDN goes down, it can take heaping portions of the internet along with it. Although that alone doesn’t quite explain how the impacts on Tuesday were so far-reaching, especially when there are so many redundancies built into these systems. Or at least, there should be.

CDNs Consolidated

For the better part of Tuesday, it was unclear exactly what had transpired at Fastly. “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration,” a company spokesperson said in a statement that morning. “Our global network is coming back online.”

Late Tuesday, the company offered more specifics in a blog detailing the incident. The root cause actually dates back to May 12, when the company inadvertently introduced a bug as part of a broad software deployment. Like a rune that only unlocks its evil powers under a certain incantation, the bug was harmless until and unless a Fastly client configured their set-up in a specific way. Which, nearly a month later, one of them did.

The global disruption kicked off at 5:47am ET; Fastly spotted it within a minute. It took a bit longer—until 6:27am ET—to identify the configuration that triggered the bug that caused the failure. By this point, 85 percent of Fastly's network was returning errors; every continent other than Antarctica felt the impact. They started coming back at 6:36am ET, and everything was mostly back to normal by the top of the hour.

Even after Fastly had fixed the underlying issue, it cautioned that users could still see a lower “cache hit ratio”—how often you can find the content you’re looking for already stored in a nearby server—and “increased origin load,” which refers to the process of going back to the source for items not in the cache. In other words, the cupboards were still fairly bare. And it wasn't until they were replenished globally that Fastly tackled the underlying bug itself. They finally pushed a "permanent fix" several hours later, around lunch time on the East Coast.

That an outage occurred is surprising, given that CDNs are typically designed to weather these tempests. “In principle, there is massive redundancy,” says Sitaraman, speaking about CDNs generally. “If a server fails, others servers could take over the load. If an entire data center fails, the load can be moved to other data centers. If things worked perfectly, you could have many network outages, data center problems, and server failures; the CDN’s resiliency mechanisms would ensure that the users never see the degradation.”

When things do go wrong, Sitaraman says, it typically relates to a software bug or configuration error that gets pushed to multiple servers at once.

Even then, the sites and services that employ CDNs typically have their own redundancies in place. Or at least, they should. In fact, you could see hints of how diversified various services are in the speed of their response this morning, says Medina. It took Amazon about 20 minutes to get back up and running, because it could divert traffic to other CDN providers. Anyone who relied solely on Fastly, or who didn’t have automated systems in place to accommodate for the disruption, had to wait it out.

“The outage was the result of monoculture,” says Roland Dobbins, principal engineer of security firm Netscout. He suggests that every organization with a substantial online presence should have multiple CDN providers to avoid precisely this sort of situation.

Their options, though, are increasingly limited. Just as the cloud has largely been subsumed by Amazon, Google, and Microsoft, three CDN providers—Cloudflare, Akamai, and Fastly—dominate the flow of content online. “There’s a lot of concentration of usage within very few service providers,” Medina says. “Whenever any one of those three providers has an issue, typically it’s not something that lasts a very long time, but it has a major impact across the internet.”

That’s a big part, Medina says, of why these sorts of outages have been more frequent of late, and why they’ll only continue to get worse. Baseball needs a cutoff man; intersections need traffic cops. The fewer of those there are to rely on, the more connections get missed, and the bigger the crashes.

Additional reporting by Lily Hay Newman.

This story has been updated to include additional details from Fastly about the cause of Tuesday's outage.

Related Articles

Latest Articles