How to measure software delivery performance

Melinda Fekete

Documentation Lead

May 15, 2026

Traditional release counts are vanity numbers. Knowing a team shipped 50 times a month means nothing without knowing how many of those deployments broke production or how long the team took to recover. An honest read of engineering performance needs a framework that measures throughput and stability together. The four core DORA metrics, plus reliability, give you that baseline. Moving them, though, takes more than better dashboards. It takes a change to your underlying delivery architecture. Decoupling deployment from release turns feature flags into a measurement substrate that makes every DORA metric easier to move.

TL;DR

Release frequency on its own is a vanity metric. Throughput only means something when you measure it alongside stability.
Speed and stability reinforce each other. Adopting AI without runtime safety controls can drag delivery stability down by 7.2 percent.
Over-engineering data extraction pipelines from your source control rarely pays off. Get a simple baseline first.
Decoupling deployment from release makes the metrics movable: you can verify in production and roll back without redeploying.
Moving to a FeatureOps model lets teams scale to 100+ daily releases by tying deployment to runtime control rather than release windows.

Defining the four DORA metrics and reliability

Real performance measurement replaces raw activity counts with metrics that show whether your system creates value or just noise. You need to track how quickly ideas become code alongside how that code behaves once it is live. The original DORA framework defines four indicators, split between throughput and stability.

Throughput: deployment frequency and lead time for changes

Throughput starts with deployment frequency and lead time for changes. High-performing teams compress lead time from successful commit to live deployment to less than a single day, allowing them to deploy multiple times on demand. Volume alone means nothing, but improving these two variables forces you to work in small, testable batches that flow smoothly through your pipeline.

Stability: change failure rate and time to restore service

Speed becomes a liability without stability measured alongside it. The change failure rate is the percentage of updates that cause incidents requiring remediation, and time to restore service tracks the resulting recovery minutes. Top development teams spot production issues quickly enough to restore normal service in under an hour, with a failure rate that sits steadily around 5 to 15 percent.

Reliability as the fifth metric

Google updated the framework in 2021 by adding reliability as the fifth operational metric. The expansion makes sure teams evaluate the operational health of the running software, not just the mechanics of deployment. Grading only the deployment steps obscures the real user experience. Running available, efficient applications matters as much as shipping the initial code quickly.

How the industry clusters performance tiers

In 2022, the State of DevOps Report formally dropped the Elite performer category, clustering teams into High, Medium, and Low tiers. The redefinition pushes organizations toward broad systemic improvement rather than chasing a status badge.

DORA research consistently shows that speed and stability reinforce one another, and top performers do well on throughput and stability at the same time. Rushing poor code does not save time. Faulty updates break the environment, trigger pipeline rollbacks, and the recovery work drags overall deployment frequency back down.

Technology adoption shifts the cluster picture in less obvious ways. AI tools improve individual productivity, with measured gains of 7.5 percent in documentation quality and 3.4 percent in code-writing velocity. The same data shows that AI adoption can reduce delivery stability by 7.2 percent. Developers write code faster, abandon small batch sizes for large pull requests, and the build breaks more often.

Apply DORA metrics to entire applications over time. Evaluate the application workflow directly. Measuring individual outcomes distorts the picture. Using these numbers to manage the daily performance of single engineers damages team dynamics and pushes people to game the statistics at the expense of overall product health.

Overcoming the instrumentation trap

Standardizing on baseline metrics solves the alignment problem, but extracting the numbers introduces a real engineering hurdle. The deployment, change, and incident records you need live scattered across systems. Pull requests sit in source control. Deployments flow through CI/CD tooling. Incidents land in ticketing platforms.

A common reaction is to build a heavy data extraction pipeline that stitches the systems together into a unified dashboard. DORA guidance flags over-engineering this data pipeline as a major anti-pattern. Building complex integrations only observes your bottlenecks. Resolving the underlying delivery workflow yields better results.

Hardware limits rarely explain poor metrics either. Roughly 90 percent of CPUs in continuous integration tools sit unused, yet teams still suffer queuing delays. Throwing compute at the pipeline does not fix process inefficiencies. You spend money to wait in the same long line.

Calculating a baseline change failure rate, for example, requires capturing the total number of attempted deployments and mapping production incidents back to those specific deployments by unique markers. Automating that mapping across disparate tools needs ongoing maintenance. Treating feature flags as the observable measurement substrate sidesteps the extraction problem entirely. The system that controls feature visibility becomes your primary performance record.

Where feature flags directly move each metric

Reporting pipelines record the history of a failure. To improve your delivery scores, you have to change the architecture itself. Bringing feature flags into the equation pulls operational risk out of the continuous delivery pipeline. When you stop relying on slow deployment phases to protect users, every metric becomes easier to move.

Accelerating throughput

Feature flags reduce lead time by letting developers verify production changes minutes after the code is written. The code exists in production but stays hidden from end users. Then you can run final verifications on live infrastructure without scheduling massive release windows or fearing instant customer impact.

Consider a mid-stage technology team trying to ship a new checkout service. Under a traditional model, they merge code on Tuesday and wait for the Thursday night maintenance window. Marketing, engineering, and support sit on a 2 a.m. call hoping the database migration holds. With feature flags, that same team can merge on Tuesday morning, deploy silently to production five minutes later, toggle the flag for QA on Tuesday afternoon, and turn it on for internal staff on Wednesday. Throughput climbs because the deployment itself carries no customer risk.

Mercadona Tech proved this model at scale. Moving away from monolithic releases helped the organization release to production over 100 times a day with a 13-hour average deployment cycle time using a FeatureOps approach. Talentech ran a similar transformation, moving from a rigid quarterly release motion to shipping new features safely every week.

Reducing change failure rate with progressive rollouts

A change that ships to 1 percent of users has a 1 percent blast radius. That is the core mechanism behind progressive rollouts. Wrapping a change in a feature flag and exposing it gradually — internal users, then beta cohort, then 10 percent, then 50, then 100 — turns “the deploy went bad” from a global incident into a contained signal you can read before most users see anything.

Targeting by geography, customer type, or any attribute you define narrows the blast radius further. Over time, change failure rate drops because most “failed changes” never reach the wider user base.

Controlling instability with runtime kill switches

Pipeline rollbacks are slow. Reverting a bad commit, waiting for CI to pick it up, running the test suite, and redeploying the older image takes anywhere from 10 minutes to an hour, while customers see error pages.

Runtime kill switches change the math. If an exception spikes after a new service deploys, returning the environment to a stable state means flipping a boolean. The new code path stops running. With sub-second propagation through Unleash Enterprise Edge streaming or on the next SDK refresh, recovery shrinks from 45 minutes to seconds.

Tracking performance with Unleash dashboards

Toggles shift the primary control point from the codebase to the live environment, which makes platform visibility the last piece of the loop. Unleash connects feature visibility directly to operational outcomes so metric evaluation does not depend on guesswork. Modern release management needs native dashboards that link system health to logical releases. Teams get this through the Unleash dashboard, which measures lead time for changes per feature, and the Insights Dashboard, which surfaces average time to production. Activation feedback comes out of the box, no scraping CI logs.

Automated workflows let teams actively govern change failure rate. Watching static metric boards for failures wastes response time. Practitioners use real-time impact metrics inside the platform. If an automated rollout expands to 25 percent of your user base and server latency climbs, the system reads the telemetry and pauses the release. You command the rollout directly from the data it generates.

Shifting control from code shipment to feature activation

Improving your CI pipeline eventually hits a physical ceiling defined by the speed of your test suite. Shifting the primary control point from initial code shipment to actual user activation gets you past that bottleneck.

Feature flags act as an intelligent measurement substrate, letting engineering organizations manage business outcomes dynamically. Decoupling the release lifecycle fixes the deployment process itself. Structuring your architecture to govern runtime safely makes excellent delivery metrics the default outcome.

FAQs about software delivery performance

What is the difference between deployment and release?

A deployment pushes new code to a production environment. A release exposes that functional code to end users. Decoupling the two operations lets teams test quietly in live environments without risking customer experience. Feature flags isolate the release decision from the deployment pipeline.

What are the four DORA metrics?

The framework measures two throughput indicators and two stability indicators to balance the picture. Throughput covers deployment frequency and lead time for changes. Stability covers change failure rate and time to restore service. High-performing teams improve all four at the same time.

Why did DORA add a fifth metric?

Google added reliability so teams measured the operational health of the running software, not just the mechanical speed of the pipeline. Speed means little if the resulting application drops user sessions or misses availability targets.

How do feature flags reduce lead time for changes?

Developers can verify code safely in the live production environment without waiting for technical release windows. Production verification happens minutes after the code is written. The delay between completing a commit and reaching a verified production state shrinks to almost zero.

Can DORA metrics measure individual developer performance?

Applying operational metrics to evaluate or discipline individual engineers creates perverse incentives. The DORA framework evaluates workflow efficiency and application delivery processes over a broad timeline. Using these numbers for individual performance reviews encourages developers to shrink their commits to meaningless sizes to manipulate the statistics.

Share this article