Cloudflare Crashed. Half the Internet Went Down. Here’s What I Learned.
ChatGPT went offline. Spotify stopped working. Dropbox, Zoom, Reddit—all down.
The cause? A database permissions change.
Not a cyberattack. Not a hardware failure. A permissions change.
What Actually Happened
Cloudflare’s team made a sensible security improvement. They moved from a shared database account to individual user accounts. Better audit trails. Better security. Standard practice.
The new permissions gave users access to a different database called “r0.” A query started returning 200+ features instead of the expected 60.
The Bot Management module has a hard limit of 200 features. It crashed.
The crashes spread across edge servers. Some servers received good config files and recovered. Others didn’t. The intermittent recovery made the problem harder to isolate.
The investigation took 2.5 hours.
Why So Long?
An unrelated issue happened at the same time. Cloudflare’s status page went offline too.
The team thought they were under a DDoS attack. They spent precious time investigating the wrong problem.
This is the part that hit home for me.
Three Things I’ve Learned Managing Production Systems
I’ve run production systems across three startups. I’ve managed teams of 50+ engineers. I’ve seen my share of outages. Here’s what incidents like this have taught me:
Two problems at once confuse everyone.
The status page failure made Cloudflare’s team assume an attack. Smart engineers looked in the wrong place first.
This happens to the best teams. When you see two symptoms, your brain connects them. You build a story that explains both. The story feels right because it’s elegant. But elegance can be a trap.
At WisdmLabs, we now have a rule: when two things break at once, we split the team. One group investigates each problem independently. We compare notes only after we have separate hypotheses. It feels inefficient. It prevents the most common mistake.
Postmortems teach more than success stories.
Cloudflare shared exact code and timeline details. Matthew Prince, the CEO, wrote the first draft from Lisbon the same evening. The incident was still warm, and they were already documenting.
Most companies hide their failures. The transparent ones help everyone get better.
After every significant incident at WisdmLabs, we write a postmortem. We share it with the whole engineering team. The format is simple: what happened, why it happened, what we’re changing. No blame. Just learning.
The outages we’ve learned the most from were the embarrassing ones. The ones where a single line of code took down a service. The ones where we missed something obvious. Those postmortems get read and remembered.
Dependencies are expensive.
CDNs like Cloudflare power most of the internet. When they go down, your site goes down. There’s no getting around this.
You can redirect traffic to your own servers. You can use a backup CDN. Both options cost money and time to maintain. Both add complexity. Both create new failure modes.
I’ve dealt with this across my career. There’s no easy answer. The honest advice is this: know your dependencies, understand what happens when they fail, and decide explicitly which risks you’re accepting.
At WisdmLabs, we serve hundreds of clients. A CDN outage affects all of them. We’ve made explicit decisions about which dependencies we’ll accept and which we’ll hedge against. The decisions aren’t perfect. They’re intentional.
The Deeper Pattern
Every major outage I’ve studied follows the same pattern:
- A reasonable change gets made
- An unexpected interaction occurs
- The system fails in a way nobody predicted
- Recovery is slower than expected because of confusion
The Cloudflare incident is textbook. The permissions change was reasonable. The feature count interaction was unexpected. The status page failure added confusion. Recovery took 2.5 hours instead of 30 minutes.
You can’t prevent unexpected interactions. Systems are too complex. But you can build for fast recovery. You can train for confusion. You can practice incidents before they happen.
What I’d Add to Any Postmortem
Cloudflare’s postmortem is excellent. If I were writing it for my team, I’d add one section: “What we’d do differently if this happened again tomorrow.”
Not “what we’re changing to prevent this.” That’s standard. I mean: if the exact same incident happened tomorrow, with our current systems, how would we respond faster?
That question forces practical thinking. It separates “nice to have” improvements from “would have saved us an hour” improvements.
The full Cloudflare incident report is worth reading. They show the actual code that panicked and explain each step of their response.
What production incidents have taught you the most? I’m always looking for postmortems to learn from.