The holes that kill you are the ones you never tested

Every time something goes down, the first instinct in the room is to add a layer. Another replica. A second region. One more health check in front of the thing that broke. It feels like progress, and the Swiss cheese model hands that instinct a lovely picture to point at.

You know the diagram even if you've never heard the name. Stacked slices of cheese, each slice a layer of defence, every slice riddled with holes. An accident only makes it through when the holes in all the slices happen to line up. James Reason built it to explain how hospitals and aircraft fail, and it's still the best tool we have for explaining why a system with five safety nets can still face-plant.

The picture has a side effect though. It teaches you to count slices. And counting slices is mostly the wrong job.

Take redundancy, since that's where slice-counting does the most damage. The model says two of everything beats one. Two servers, two regions. That holds right up until both copies share a single reason to die at the same instant.

On 19 July 2024 CrowdStrike shipped a content update to its Falcon sensor and bricked around 8.5 million Windows machines, by Microsoft's estimate, inside a few hours. Every one of those hosts was "redundant" in somebody's architecture diagram. Made no difference. They ran the same agent and swallowed the same bad file at the same moment, so they all dropped together. When the failure is perfectly correlated like that, redundancy buys you nothing. NASA's reliability folks put it bluntly: if one fault can take out the backup too, two subsystems fail twice as often as one.

A blueprint cross-section of stacked redundant defence layers, each riddled with holes, with a single skewer passing through a hole aligned in the exact same spot on every layer.

There's a quieter version of the same trap hiding in your SLOs. String together a dozen services that are each up 99.9% of the time and your ceiling is about 98.8% before you've written a line of your own code, because every dependency is one more slice with its own holes. You can't promise more reliability than the flakiest thing you lean on, and each extra nine costs more than the last.

The slices you can draw on a whiteboard are the safe ones. The holes that get you are the ones nobody can see: the failover that's never been run, the assumption that stopped being true six months ago when no one was looking, the dead code still sitting in production behind a flag.

Knight Capital is the one I'd tattoo on a junior engineer. On 1 August 2012 they switched on new trading code and pushed it to seven of their eight servers. On the eighth, a reused flag woke a dead function called Power Peg that should have been ripped out years earlier. For about 45 minutes it hurled orders into the market with nothing counting the fills: more than four million executions and 397 million shares. Knight's own books put the loss at roughly $440 million, and the firm didn't survive the week.

Pull it apart and every hole was survivable on its own: dead code left in the build, a flag doing double duty, a deploy one person ran with nobody reviewing it, 97 warning emails before the open that everyone read as noise. None of those sinks you alone. Line them up on a single Tuesday morning and the company is gone.

That's Reason's real point, and it's a good one. The model is right. It's just coarse. Richard Cook said it better in How Complex Systems Fail: complex systems run in a degraded state the whole time, stuffed with small faults, staying up because people quietly hold them together. Catastrophe needs a few of those faults to meet. A sixth slice does nothing about the holes already living in the five you've got.

Your architecture diagram and your runbook describe the system you imagine you have. The real system is whatever your on-call engineer does at 3am to keep it limping along. Erik Hollnagel calls that gap work-as-imagined versus work-as-done, and the holes love living in it.

The useful work is dragging those invisible holes into daylight while you still get to pick the timing. Adrian Cockcroft has the perfect name for skipping it. He calls an untested failover "availability theatre": you've got the runbook, you've got the standby, you feel great, and you have no clue whether any of it works because you've never once pulled the plug to watch.

A blueprint cross-section of the same holey layers slid apart and probed by hand, a thin light beam catching two holes about to line up.

This is why the teams who are good at this spend their hours on stuff that looks unglamorous. They run game days and break things in production on purpose, the way Netflix set Chaos Monkey loose to kill its own servers at random, just to prove the system could take it. They write blameless postmortems, because the day people get punished for an outage is the day they stop telling you where the holes are. They treat reliability as a budget they spend, not a feeling they chase.

Most of that is a people problem wearing a technology costume. The highest-leverage slice in your whole stack is usually the on-call engineer at 3am, and whether your culture lets them say "yeah, I broke it, here's how" without flinching. No cloud provider sells that one.

I've built billing and payments platforms where downtime means somebody doesn't get paid, and shipped to production many times a day. The urge to bolt on another layer never goes away. It just gets more expensive, and more comforting.

By all means keep your redundancy. Keep the regions and the replicas. Just don't kid yourself that a standby you've never failed over to is a second slice of cheese. It's a hole with a comforting label, and you'll find out which one it really is on the worst possible morning.