The Fastly Outage — A Masterclass in Single Points of Failure

Two days ago, large portions of the internet went dark. Amazon, Reddit, the New York Times, the UK government’s gov.uk, Twitch, Stack Overflow — all unreachable. The culprit wasn’t a sophisticated cyberattack or a catastrophic hardware failure. It was a single valid configuration change deployed to Fastly’s CDN network that triggered an undiscovered software bug. The outage lasted roughly an hour, but the shockwaves will be felt far longer.

I’ve been working in this industry long enough to remember when we thought putting everything behind a CDN was the ultimate reliability play. And in most cases, it still is. But Tuesday’s incident is a stark reminder that abstracting complexity doesn’t eliminate it — it concentrates it.

What Actually Happened
#

Fastly has been refreshingly transparent about the incident. A customer pushed a valid configuration change that triggered a bug in their software, which caused 85% of their network to return errors. Their engineering team identified the issue within minutes, and most services were restored within 49 minutes.

Let’s be clear: a 49-minute detection-to-resolution time is genuinely impressive. Many organizations take longer to acknowledge an incident exists, let alone resolve one affecting global infrastructure. Fastly’s engineering team deserves credit for that.

But the root cause — a latent bug activated by a routine configuration change — is the kind of failure mode that should make every infrastructure engineer uncomfortable. This wasn’t an edge case or an unreasonable input. It was a normal customer action that happened to tickle a code path nobody had tested thoroughly enough.

The Concentration Problem
#

Here’s the uncomfortable architectural reality: we’ve traded distributed fragility for concentrated fragility. The old internet was unreliable in lots of small ways — individual servers went down, individual sites had issues. The modern internet is reliable almost all the time, but when it fails, it fails spectacularly because so much traffic flows through so few providers.

Fastly, Cloudflare, and Akamai collectively handle an enormous percentage of web traffic. AWS, Azure, and GCP underpin most of the services behind those CDNs. The dependency graph of the modern internet looks less like a resilient mesh and more like an hourglass — millions of clients on one side, millions of origin servers on the other, and a surprisingly small number of infrastructure providers in the middle.

I’ve had this conversation with clients dozens of times: “What’s your disaster recovery plan if your CDN goes down?” The typical answer is a blank stare. We’ve internalized the assumption that these services are essentially utilities — always available, like electricity. But even electricity grids have redundancy built in at multiple levels. Most web architectures have a single CDN provider and no fallback.

What This Means for Architecture Decisions
#

If you’re running anything where availability matters — and let’s be honest, that’s most things — this outage should prompt some concrete architectural questions:

Multi-CDN strategies: Running traffic through multiple CDN providers with DNS-level failover is technically feasible but operationally complex. You’re managing configuration, cache invalidation, and SSL certificates across multiple platforms. The cost is real, but for critical services, it’s worth evaluating. Companies like Citrix (NetScaler) and NS1 offer intelligent DNS routing that can detect CDN failures and redirect traffic.

Origin resilience: Can your origin servers handle direct traffic if the CDN layer disappears? Many applications have scaled their origin infrastructure based on the assumption that the CDN absorbs 90%+ of the traffic. If that CDN layer vanishes, the origin gets crushed. Load testing without the CDN in the path is an exercise few teams perform but many should.

Graceful degradation: Could your application serve a reduced experience directly? Static HTML fallback pages, service workers with cached content, progressive enhancement that doesn’t depend on a CDN for core functionality — these are patterns that existed before CDNs became ubiquitous, and they’re still valuable.

Monitoring from the outside: If your monitoring infrastructure runs through the same CDN as your production traffic, you might not even know you’re down. External monitoring from diverse network paths isn’t optional for serious production systems.

The Software Bug Angle
#

Beyond the architectural conversation, there’s a software quality story here. Fastly runs Varnish Configuration Language (VCL) that customers can customize extensively. The interaction between customer configurations and the underlying platform software creates a vast state space that’s genuinely difficult to test comprehensively.

This is a pattern I see repeatedly in platform engineering: the more flexibility you give users, the harder it is to guarantee that every possible combination of valid inputs produces correct behavior. It’s the configuration-as-code challenge writ large. Every configuration option multiplies the testing surface. Every customer-facing knob is a potential trigger for an untested interaction.

Fastly will fix this specific bug, undoubtedly. But the class of bug — latent defects triggered by valid configuration changes — is essentially unsolvable through testing alone. It requires defense-in-depth: canary deployments for configuration changes, blast radius limitation, circuit breakers that prevent a single bad configuration from propagating globally.

My Take
#

I don’t think this outage means you should abandon CDNs or start building everything on bare metal. The reliability gains from CDN infrastructure are real and significant. What it does mean is that we need to stop treating any single infrastructure provider as infallible.

The internet was designed to route around damage. Somewhere along the way, we built an application layer that routes everything through the same handful of chokepoints. That’s not a CDN problem — it’s an architecture problem. And it’s one we’ve collectively chosen because it’s cheaper and simpler than true redundancy.

For most of my clients, the pragmatic takeaway is this: understand your CDN dependency, have a documented (and tested) plan for when it fails, and make sure your monitoring can actually detect the failure. That won’t prevent the next outage, but it’ll make the difference between a 49-minute inconvenience and a 49-minute crisis.

The internet isn’t as resilient as we like to pretend. Tuesday made that impossible to ignore.

Cloud Operations - This article is part of a series.

Part : Cloud FinOps — Why Engineers Own the Cost Conversation Now

Part : OpenTelemetry Reaches GA for Logs — The Three Pillars Are Finally Complete

Part : The Stargate Project — $500 Billion and the Future of AI Infrastructure

Part : NVIDIA's Q2 Numbers Are Staggering — What It Tells Us About AI Infrastructure Demand

Part : IBM Acquires HashiCorp — What It Means for the Infrastructure-as-Code Ecosystem

Part : Broadcom's VMware Overhaul — The Virtualization World Is Rattled

Part : NVIDIA's $22 Billion Quarter — The AI Infrastructure Gold Rush Is Real

Part : Cloudflare R2 Goes GA — The S3-Compatible Storage War Heats Up

Part : Heroku Kills the Free Tier — End of an Era for Developer Onboarding

Part : Broadcom's $61 Billion VMware Bet — What It Means for Cloud Infrastructure