Feature Flags don’t shift your bugs right. Here’s what actually does.

Wojtek Gawroński

Developer Advocate

April 29, 2026

At a recent webinar led by Alex Casalboni, a sharp statement landed in the chat:

“Feature flags are great but they actually shift bugs right and also tend to not be as tightly controlled as an automated release process as the flag changes are typically done with config rather than code. This gets worse as team sizes grow and you have more and more teams flipping flags on and off.”

This is a great gift. It’s also completely wrong – but in an interesting, productive way that’s worth taking apart, because the assumptions behind it reveal exactly why FeatureOps exists.

Let’s be honest about what’s being said here. Strip the diplomatic phrasing and the argument is: feature flags let you skip quality work, ship broken code, and hide behind a toggle.

The shift right language is a pun, a deliberate inversion of the industry’s beloved shift left – testing earlier, catching bugs sooner, building quality in. The implication is that feature flags are shift left’s evil twin: shipping first, asking questions later.

Here’s the problem with that framing: it confuses the escape route with the reason you need one.

The “shift right” accusation assumes a world that doesn’t exist

Allow me to start with the following thesis: bugs don’t shift right because you added a feature flag. Bugs were always right. I mean, they are wrong, but they were on the right side of the spectrum. Production has always been where the real surprises live. The only question is whether you have a mechanism to respond in seconds or whether you’re locked into a full rollback cycle that takes minutes to hours while your users absorb the impact.

The shift-left model presupposes that you can catch every defect before production; that your staging environment faithfully reproduces production; that your load tests simulate a real production environment with its intricate concurrency; that your integration tests cover every third-party dependency’s failure mode; that your test accounts behave like real tenants with real data at real scale.

They don’t. And they never have. Because they are approximations. And the more distributed, more interconnected, more AI-accelerated your systems become, the wider that gap grows.

In 2025 alone, Google Cloud and Cloudflare – two organizations with engineering maturity most teams will never reach – suffered major outages from routine changes shipped without runtime controls. Both postmortems pointed to the same gap: missing feature flags or kill switches. Google Cloud’s own analysis noted that if the change had been wrapped in a feature flag, the issue would have been caught before reaching global impact. Cloudflare committed to enabling more global kill switches.

These weren’t teams that skipped their tests. These weren’t teams that winged their release process. These were teams with some of the most sophisticated CI/CD pipelines on the planet, and production still surprised them.

The real question isn’t “did you test enough?”

The real question is “how fast can you undo?”

Let’s run the thought experiment. Two teams ship the same bug:

Team A has no feature flags. They have an excellent CI/CD pipeline, comprehensive test coverage, and a staging environment that’s reasonably close to production. The bug slips through anyway – it’s a concurrency issue that only manifests under production load patterns. They detect it via monitoring. Now they have two choices: roll back the entire deployment (minutes to hours, depending on pipeline complexity and downstream dependencies), or push a hotfix (another full CI/CD cycle). During that window, every user is affected.

Team B has the same pipeline, the same tests, the same staging environment – plus a feature flag wrapping the new code path. The same bug slips through. They detect it via the same monitoring. They flip the flag in seconds. The bug is contained. No rollback, no emergency deployment, no all-hands incident bridge at 2 AM. They now have time to diagnose properly, fix properly, and re-release properly – without the pressure of a production fire burning while they work.

Which team shifted bugs right? Neither. Both had the same bug. Both shipped it. The difference is that Team B had reversibility as a first-class property of their release, and Team A didn’t. The critique gets the causality backwards: feature flags don’t cause bugs to reach production. They give you a sub-second response when bugs inevitably do.

Let’s reframe the original critique

I would allow myself to reiterate the sharper version of the original critique, as it is worth naming directly: “feature flags create a culture where teams skip proper testing because they know they can always flip a switch.”

That’s a real risk – and it’s not a flag problem, it’s a discipline problem. Any powerful tool that lowers the cost of recovery can, in the wrong environment, lower the perceived cost of failure. The same accusation gets thrown at CI/CD (“teams deploy untested artifacts because they can hotfix in ten minutes“), at cloud infrastructure (“teams over-provision because scaling is easy“), at microservices (“teams ship half-baked services because they’re isolated“).

In each case, the answer isn’t to abandon the capability – it’s to pair the capability with the practices that make it work at scale. That’s what FeatureOps is: feature flags plus the discipline around them.

Runtime control vs. configuration updates

The second part of the original question deserves its own answer:

“Flag changes are typically done with config rather than code. This gets worse as team sizes grow.”

This is a real concern. It’s also a solved problem. Most in-house feature flag systems are just a database table behind a REST endpoint and a lot of undocumented behaviours, resulting in a brittle system that makes the overall application less resilient to failures. On the contrary, an enterprise-ready feature flag system with proper change management, lifecycle management, and built-in auditability will improve your resiliency.

The questioner later revealed (in a follow-up message, after admitting the original question was a bit of a troll to pressure test) something telling:

“I work at a place now that has a homegrown feature flag system that has run amok. There is no auditing of flag enablement or config changes. Flags go in and are then used forever as quick and dirty ways to update business logic… it’s horrible and dangerous… but it’s also a victim of its own success as it removes friction from the development teams by providing a back door.”

And there it is. The pain behind the question isn’t about feature flags. It’s about feature flags without FeatureOps.

A database row that anyone can flip without an audit trail isn’t a feature flag system – it’s a liability with an API. NThe problems described – no auditing, no lifecycle management, flags used as permanent business logic backdoors, no governance. – Tthese are symptoms of a homegrown system that stopped evolving the moment it started working: database table + REST API + no controls. It has the runtime dependency overhead of a real flag platform without any of the capabilities that justify that cost. There’s a maturity spectrum between “remote config with an if-statement” and a genuine runtime control platform – and that gap deserves its own deep dive.

A mature FeatureOps practice addresses every single concern raised:

“No auditing of flag changes”

Every state change has a full audit trail. Who changed what, when, and why. When your first question during an incident is “what changed?”, the answer is immediate and complete.

“Flags go in and are used forever”

Lifecycle management with expiration tracking, health dashboards that surface stale flags, and automated cleanup policies that prevent flag sprawl from compounding into technical debt.

“Quick and dirty ways to update business logic”

Change request workflows with configurable approval gates. The four-eyes principle. Role-based access control by project, environment, and flag. The friction is intentional where it needs to be, and invisible where it doesn’t.

“Not as tightly controlled as code”

This is the key misconception. Runtime control doesn’t mean uncontrolled. It means controlled at a different level of the stack – with governance that matches the speed and scope of the decisions being made.

The shift-left/shift-right framing is a false binary

Here’s the irony: you can make the exact same loaded accusation about shift left. Shift left causes over-engineering. It complicates the development process; it creates test suites that take 45 minutes to run and catch 60% of real-world issues; it gives teams false confidence that production is safe because staging was green.

Both framings are reductive. The reality is that quality is a spectrum of practices across the entire lifecycle, not a single direction on a number line.

Alex coined an elegant analogy for that: construction sites. Workers use protective equipment (hard hats, belts, safety nets, rails, etc.), are trained and examined on Health & Safety rules, all of which are to ensure they work in a safe environment. On the other hand, when an emergency occurs, they need to provide first aid, reach medical services, and call an ambulance as soon as possible, as reaction time matters. In such environments, there is no discussion of one being right and the other wrong. Both ends of that spectrum are critical.

Protective equipment and H&S rules are exactly like shift-left. Having them in place is priceless: catch what you can early. Static analysis, unit tests, integration tests, contract tests, security scanning. – Aabsolutely, do all of it. But shift-left alone is incomplete because production is fundamentally different from any test environment.

FeatureOps doesn’t replace shift left.

FeatureOps completes the picture. It says: yes, test thoroughly before production. And also have the runtime controls to respond instantly when production teaches you something your tests couldn’t. It’s exactly like first aid or medical / ER services – fast, reactive, and protective control. Progressive rollouts that expose 1% before 100%. Kill switches that respond in seconds. Automated rollback tied to SLOs that don’t require a human to notice a dashboard.

This isn’t an either/or. It’s defense in depth.

The coordination concern is real – and it’s not caused by flags

“This gets worse as team sizes grow and you have more and more teams flipping flags on and off. This can be managed with orchestration/coordination among teams but this just slows the feature release cycle time.”

Fair point. But here’s the thing: the coordination problem exists regardless of whether you use feature flags. Without flags, you’re coordinating merge order, release trains, deployment windows, and rollback sequences. With flags, you’re coordinating flag states and rollout schedules.

The difference is that flag-based coordination is more granular, more reversible, and more observable than deployment-based coordination. You can release Feature A to 10% while Feature B is at 100% while Feature C is internal-only. Try doing that with deployment coordination alone.

And yes, this requires tooling. Release templates that standardize multi-stage rollout blueprints. Segments that keep targeting rules consistent. Lifecycle governance that tracks ownership and cleanup dates. This is exactly the operational infrastructure that distinguishes a flag system from a FeatureOps platform.

The real danger is the lack of discipline and governance

A homegrown system with no audit trail, no lifecycle management, no approval workflows, and no integration with your observability stack is a ticking time bomb that happens to be a victim of its own success.

But the conclusion shouldn’t be that “feature flags shift bugs right.” The conclusion should be: a toggle without governance is a liability, and a toggle with governance is a safety net.

That’s the gap FeatureOps closes. Not by arguing that you should skip quality, – but by acknowledging that quality under time and resource pressure has limits, production has surprises, and the speed at which you can respond to those surprises is the difference between a blip and an outage.

DevOps optimized the path to production. FeatureOps extends that discipline into what happens after code is running. The goal isn’t speed for its own sake. The goal is connecting engineering effort to user outcomes – releasing with confidence, learning from every deployment, recovering from problems in seconds instead of hours.

Feature flags don’t shift bugs right. They give you a sub-second left turn when a bug shows up where it was always going to end up anyway.

Where exactly do runtime decisions belong relative to build-time and deploy-time configuration? How do you tell when a decision is misplaced at the wrong level of your stack? We’re working on a deep-dive series that maps this landscape for infrastructure engineers. Stay tuned.

Share this article