And that’s not all ! CDNs don’t just store content closer to the devices that need it. They also help direct it on the Internet. “It’s like orchestrating traffic on a vast road network,” says Ramesh Sitaraman, a computer scientist at the University of Massachusetts at Amherst who helped create the first major CDN as a senior architect at Akamai. “If a link on the Internet fails or is congested, the CDN algorithms quickly find an alternate route to the destination. “
So, you can begin to see how when a CDN goes down, it can take many portions of the internet with it. While that doesn’t quite explain just how big Tuesday’s impacts were, especially when there is so much redundancy built into these systems. Or at least there should be.
Again, it’s unclear exactly what happened to Fastly. “We have identified a service configuration that has triggered disruption to our POPs around the world and have disabled that configuration,” a company spokesperson said in a statement. “Our global network is coming back online. “
“Service configuration” can mean a number of things; the only certainty is that whatever the root cause, it had far-reaching effects. According to the Fastly incident report page, all continents other than Antarctica felt the impact. Even after Fastly fixed the underlying issue, it warned that users could still see a lower “cache hit rate” (how often you can find the content you are looking for already stored on a nearby server) and “increased origin load,” which refers to the reverting process for items that are not in the cache. In other words, the cupboards are still quite bare.
That an outage has occurred is surprising, given that CDNs are typically designed to withstand these storms. “In principle, there is massive redundancy,” says Sitaraman, speaking of CDNs in general. “If one server goes down, other servers could take over. If an entire data center goes down, the load can be shifted to other data centers. If things worked perfectly, you could have a lot of network outages, data center issues, and server failures; CDN’s resiliency mechanisms would ensure that users never see degradation.
When things go wrong, Sitaraman says, it’s usually related to a software bug or a configuration error that is being transmitted to multiple servers at once.
Even so, sites and services that use CDNs usually have their own redundancies in place. Or at least they should. In fact, you can see clues to the diversity of the different services in the speed of their response this morning, Medina says. It took about 20 minutes for Amazon to get back on track, as it could divert traffic to other CDN providers. Anyone who relied solely on Fastly, or who didn’t have automated systems in place to deal with the disruption, had to wait.
“The failure was the result of monoculture,” explains Roland Dobbins, senior engineer at security company Netscout. He suggests that every organization with a substantial online presence should have multiple CDN providers to avoid precisely this kind of situation.
Their options, however, are increasingly limited. Just as the cloud has been largely absorbed by Amazon, Google, and Microsoft, three CDN providers (Cloudflare, Akamai, and Fastly) dominate the flow of online content. “There is a lot of concentration of use within very few service providers,” says Medina. “Anytime one of these three providers has a problem, it’s usually not something that lasts very long, but it has a major impact on the internet. “
This is largely, according to Medina, why these types of blackouts have been more common lately, and why they will only get worse. Baseball needs a cut man; intersections need traffic officers. The less to rely on, the more connections are missed and the more serious the crashes.
Additional reporting by Lily Hay Newman.
More great WIRED stories