Lessons we keep learning from the industry’s biggest outages
The last six months have delivered two of the most instructive outages in recent memory. And while these incidents stood out because of who they affected and the magnitude of the disruption, they fit into a long-running pattern. Over the past decade, we’ve seen similarly “impossible” failures on countless SaaS platforms, almost always triggered by a small, “safe” backend or configuration change. Each time, the public postmortem includes some variant of the same line: we didn’t expect this change to be dangerous.
When a “small backend change” takes down the internet
On June 28, Google Cloud experienced a global disruption that rippled across Gmail, BigQuery, Cloud Run, Google Meet, and other core services. The root cause was surprisingly small: a backend policy change in Google’s Service Control system that triggered a null pointer condition. The change was deployed without a feature flag, and Google’s own postmortem made the point clearly:
“The issue with this change was that it did not have appropriate error handling nor was it feature flag protected… if this had been flag protected, it would’ve been caught in staging.”
Part of what makes backend changes so deceptively hazardous is the way control-plane systems amplify them. A tiny policy adjustment or configuration update (something a single engineer might expect to impact one subsystem) can quietly propagate through dozens of services across regions. Humans are extremely bad at intuitively understanding these propagation paths, especially in systems with deep dependency graphs, distributed caches, and asynchronous replication. A change that “shouldn’t matter” can, in practice, become global within minutes.
Fast forward to November 18, ten days before Black Friday. Cloudflare, the traffic cop of the internet, suffered a widespread incident that took large portions of its network and customer traffic offline. A single backend configuration update, routine in every respect, propagated globally and resulted in five and a half hours of downtime. In its writeup, Cloudflare committed to addressing the gap by:
“Enabling more global kill switches for features.”
Why Feature Flags are necessary beyond front-end changes
Two different companies, two different systems, two different triggers. But underneath it all, the same story: a small, routine backend change shipped without runtime control. A change so unremarkable that no one expected it to be risky. That is the nature of unscheduled outages. They emerge from the changes everyone assumes are safe.
And yet, even with years of postmortems telling us the same thing, engineering teams continue to repeat the pattern. Some of this is cultural: once a workflow becomes routine, teams stop treating it with skepticism. Confidence replaces caution. Some of it is structural: organizations optimize for deployment velocity, not for reversibility. And some of it is simply the reality of modern infrastructure. Staging environments can no longer replicate production scale, traffic diversity, or data shape. We rely on them anyway, even though they routinely fail to surface the exact category of risk that later brings down production.
This is the part our industry still struggles to internalize. Feature flags are not just UI toggles. They are not cosmetic switches. They are runtime controls for application behavior. UIs, yes, but also backend logic, permission paths, policy enforcement, configuration updates, and every other invisible mechanism where real outages originate. When these controls are missing, even the most mature engineering organizations can be brought to a standstill by a single line of configuration or a simple policy update.
And the places where flags matter most are rarely the obvious ones. Authentication and authorization flows. Schema migrations. Data routing and partitioning logic. ML model activation or model-version selection. Pricing and billing rules. Traffic steering between third-party providers. Migration flags for infrastructure upgrades or control-plane behavior. These are not “features” in the conventional sense, but they are some of the highest-impact surfaces in any large-scale system. Without runtime control, they become binary, high-blast-radius deployment landmines.
When the solution really is as simple as it seems
Some aspects of running large-scale systems are genuinely complex. But these safeguards, like feature flags, are not. Gradual rollouts are straightforward. Kill switches are straightforward. Wrapping backend changes in runtime control is straightforward. These practices are widely recommended in the Google Site Reliability Engineering handbook, the AWS Well Architected Framework, by thought leaders such as Martin Fowler and Thoughtworks, and by DORA where they are a considered near-ubiquitous practice within high-performing organizations. Yet week after week, risky changes continue to land directly in production without them.
The pattern is clear, and we’ve learned it enough times. If Google can be taken down by a missing feature flag, and Cloudflare can be taken down by a missing kill switch, anyone can. The only question that remains is how many more reminders the industry needs before treating runtime controls as mandatory rather than optional.
Operational excellence at scale is about containing risk. Treating every backend change as reversible. Ensuring every high-impact path can be shut off instantly. Designing for progressive, observable rollout rather than binary deployment events. Feature flags aren’t a nice-to-have in this model. They are the model.
Black Friday is almost here. The internet really cannot decide to take another long lunch break.