The Return of Flawed Productivity Metrics: A Warning for the Gen AI Era
The rapid ascent of Generative AI, viewed simultaneously as a stock market bubble, a technology hype train, and a genuine innovation, is paradoxically driving a ‘backward slide’ in software development measurement. Organizations are increasingly reverting to outdated, Taylorist approaches, fixating on activity and output metrics like lines of code or AI usage counts, despite decades of evidence demonstrating their limited value and susceptibility to manipulation. This trend disregards critical lessons learned, particularly the principle of Goodhart’s Law: ‘When a measure becomes a target, it ceases to be a good measure,’ leading to behavioral changes that undermine genuine progress.
Vendor reports, exemplified by a recent DX Q4 2025 AI Assisted Engineering report, frequently promote metrics such as ‘90% industry-wide AI adoption,’ ‘developers save 3.6 hours per week,’ ‘22% of code is AI authored,’ and ‘daily AI users ship 60% more pull requests.’ However, these are predominantly weak activity or low-value output metrics, often self-reported, lacking statistical rigor, and failing to establish a causal link to system-level outcomes or business success. In stark contrast, the statistically validated DORA (DevOps Research and Assessment) metrics—deployment frequency, lead time for changes, mean time to restore, and change failure rate—focus on team-level sociotechnical behaviors and business results. The recommended approach emphasizes measuring these established outcome metrics, laddering up to Time to Value and Total Cost of Ownership, and ultimately to contextual business outcomes. While AI usage metrics can provide correlational insights into capability improvements (e.g., TDD adoption), they must be clearly distinguished from true indicators of technology performance and business impact.