Measuring Agentic AI Developer Productivity: A CTO’s Framework

I spent most of June looking at dashboards that told me my engineering org had never been faster. Pull requests were up. Commits were up. Every chart pointed the same direction: more. Then I asked a harder question. How much of that output actually shipped, stayed shipped, and mattered to a customer? The answer humbled me. If you’re a CTO trying to justify or defend an agentic AI budget this quarter, you’re probably staring at the same mismatch I was.

The move from generative to agentic AI changed what our teams produce, but it did not change what we know how to measure. Sprint velocity and lines of code were already weak proxies for value. Agentic AI just made them actively misleading. Here’s what I’ve learned about separating real productivity gains from noise, and the metric stack I’m now using to make budget calls.

The productivity mirage: more code, less of it survives

GitClear’s 2026 analysis of over 211 million lines of code found that regular AI users produced up to 9x more code churn than developers who weren’t using AI tools, more than double the productivity gain the tools appeared to deliver. In AI-heavy projects, the churn rate climbed from 3.1% to 5.7%, and the share of code that got properly refactored (rather than copy-pasted or rewritten from scratch) collapsed from roughly a quarter to under 10%.

Waydev’s data across 50 customers and more than 10,000 engineers tells a similar story from a different angle, cited by TechCrunch: developers accept 80-90% of AI-generated code the moment it’s suggested, but only 10-30% of it survives the following weeks of revisions. Most of what agentic tools generate gets torn up and rebuilt. If your dashboard only counts the moment of acceptance, you’re measuring a number that quietly disappears a few sprints later.

There’s a budget angle buried in this too. Jellyfish’s Q1 2026 review of 7,548 engineers found that developers with the biggest AI token budgets produced the most pull requests, but they achieved roughly twice the throughput at ten times the token cost. Spend didn’t scale with value. It scaled with volume.

Why DORA added a fifth metric

Deployment frequency and lead time for changes used to be the gold standard for engineering performance. They still matter, but getDX points out that both become misleading once AI is writing 30-70% of committed code: a team can look faster while quietly accumulating a quality debt that hasn’t surfaced in the metrics yet.

That’s part of why the DORA research group added rework rate as a fifth core metric. It tracks unplanned deployments made specifically to fix a bug that had already shipped — the clearest available signal that speed came at the cost of stability. According to Faros AI, as AI adoption accelerates, average code review time rises 91%, pull request size grows 154%, and bug rates climb 9%. The 2025 DORA report published the first official rework-rate benchmarks, and only 7.3% of teams reported rates below 2%. Most organizations are carrying far more hidden rework than their sprint reports suggest.

None of this means throw DORA out. It means DORA alone can no longer tell you whether a faster deployment cadence reflects a genuinely better process or a quality trade-off you haven’t paid for yet.

The budget wake-up call

A few stories from this year stuck with me. Uber burned through its entire 2026 AI coding budget by April. Microsoft pulled some developers’ Claude Code licenses months after rolling them out. A Priceline engineer described a routine AI coding tool contract renewal that came back 4-5x more expensive than the year before, as reported by TechCrunch. None of those teams could point to a clean productivity number to justify the spend when finance asked.

It’s not all cautionary tales. When measurement is built in from the start, agentic AI earns its keep. According to VentureBeat, an Amazon engineering team shipped its “Add to Delivery” feature two months ahead of schedule, with developers averaging 150 check-ins a week with AI assistance. Field experiments at Microsoft and Accenture logged 12.9-21.8% and 7.5-8.7% more pull requests per week, respectively, without the churn blowout seen elsewhere, because those teams tied AI usage to specific, tracked outcomes instead of open-ended adoption.

A practical metric stack for 2026

Here’s the stack I’m now asking every engineering lead to report against, before we approve another dollar of agentic AI spend:

Keep the core DORA throughput metrics (deployment frequency, lead time, failed deployment recovery time) as your baseline. Don’t discard what already works.
Add rework rate as your instability check. If it’s rising alongside throughput, your speed gains are borrowed, not earned.
Track code churn and refactor share separately from raw commit or PR counts, so acceptance-time numbers can’t hide what gets rewritten weeks later.
Measure token spend against business outcomes, not against lines of code or PR volume: cost per resolved ticket or shipped feature, not cost per token consumed.
Retire story points and sprint velocity as the primary signal for senior engineers whose job has shifted from writing code to orchestrating and reviewing agents.

None of these require new tooling you don’t already have. Most engineering analytics platforms can surface rework rate and churn today. The gap isn’t data, it’s the discipline to look at it before the renewal invoice arrives.

What I’d tell my own CTO

The teams getting real value out of agentic AI this year aren’t the ones generating the most code. They’re the ones who can tell you, with a number, how much of that code is still in production three sprints later. That’s the question I now ask before every renewal conversation, and it’s the one I’d put in front of your own team this week. What does your dashboard hide?

Services

Industries

The productivity mirage: more code, less of it survives

Why DORA added a fifth metric

The budget wake-up call

A practical metric stack for 2026

What I’d tell my own CTO