And that’s not all! CDNs don’t just store content closer to the devices that crave it. They also help direct it across the internet. “It is like orchestrating traffic flow on a massive road system,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a principle architect at Akamai. “If some link on the internet fails or gets congested, CDN algorithms quickly find an alternate route to the destination.”
So you can start to see how when a CDN goes down, it can take heaping portions of the internet along with it. Although that alone doesn’t quite explain how the impacts on Tuesday were so far-reaching, especially when there are so many redundancies built into these systems. Or at least, there should be.
For the better part of Tuesday, it was unclear exactly what had transpired at Fastly. “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration,” a company spokesperson said in a statement that morning. “Our global network is coming back online.”
Late Tuesday, the company offered more specifics in a blog detailing the incident. The root cause actually dates back to May 12, when the company inadvertently introduced a bug as part of a broad software deployment. Like a rune that only unlocks its evil powers under a certain incantation, the bug was harmless until and unless a Fastly client configured their set-up in a specific way. Which, nearly a month later, one of them did.
The global disruption kicked off at 5:47am ET; Fastly spotted it within a minute. It took a bit longer—until 6:27am ET—to identify the configuration that triggered the bug that caused the failure. By this point, 85 percent of Fastly’s network was returning errors; every continent other than Antarctica felt the impact. They started coming back at 6:36am ET, and everything was mostly back to normal by the top of the hour.
Even after Fastly had fixed the underlying issue, it cautioned that users could still see a lower “cache hit ratio”—how often you can find the content you’re looking for already stored in a nearby server—and “increased origin load,” which refers to the process of going back to the source for items not in the cache. In other words, the cupboards were still fairly bare. And it wasn’t until they were replenished globally that Fastly tackled the underlying bug itself. They finally pushed a “permanent fix” several hours later, around lunch time on the East Coast.
That an outage occurred is surprising, given that CDNs are typically designed to weather these tempests. “In principle, there is massive redundancy,” says Sitaraman, speaking about CDNs generally. “If a server fails, others servers could take over the load. If an entire data center fails, the load can be moved to other data centers. If things worked perfectly, you could have many network outages, data center problems, and server failures; the CDN’s resiliency mechanisms would ensure that the users never see the degradation.”