The Six Stages of AI Pipeline Maturity

The Pattern

Every team follows the same trajectory

The specifics vary — different languages, different stacks, different org structures — but the pattern is remarkably consistent. Teams move through recognisable stages, each one representing a concrete shift in what is automated, what is verified, and what the developer's role actually is.

Most teams plateau somewhere between Stage 2 and Stage 3. Getting past that plateau is not a matter of trying harder or prompting more cleverly. It requires a different design.

This article covers Stages 1–6. Stages 7–9 describe an expanding sphere of awareness — the system knowing itself, the organisation it serves, and the world around it — and are covered in the companion article.

Developer governing multiple concurrent AI pipelines

At Stage 5, the developer governs pipelines — not writes code

~90%

of teams stuck at Stage 2–3

4–6×

output at Stage 5 vs Stage 3

24/7

pipeline runtime at Stage 5

Stages 1–3

The developer and the assistant

In the first three stages, AI amplifies individual output — but the channel is narrow. One developer, one task, one conversation at a time. The ceiling is absolute.

Autocomplete

A faster keyboard

The tool is a faster keyboard, not a different workflow. The developer writes code; the AI completes lines and functions based on context. Nothing is automated. Full cognitive load remains with the developer.

Line and function completion
Dev reviews every suggestion
Automated: nothing

Prompted Changes

Writing time → review time

The developer describes what needs changing — the AI rewrites functions, files, and components on instruction. Output volume grows. Review burden grows with it. For small changes the trade is favourable; for large rewrites, review cost dominates.

File and component rewrites
Output volume scales fast
Review burden scales too

Collaborative Loop

Conversational back-and-forth

The workflow is conversational. The developer prompts a brief, reviews a PRD, approves an approach, then tests and debugs with AI assistance. The productivity gain is real — but not structural. One developer, one conversation, one task.

Most teams here

The ceiling

The ceiling at Stage 3 is attention bandwidth — absolute and unbreakable. No amount of better prompting raises it. The transition to Stage 4 requires building something, not using something. That distinction is what teams consistently underestimate.

The Fork

The intermediate step most teams skip

There is a decision point between Stage 3 and Stage 4 that does not announce itself as such. Teams that miss it don't know they've missed it until they're six months into pipeline work wondering why verification keeps failing.

The entire orientation of current AI tooling — Cursor, Copilot, Windsurf, every “vibe coding” workflow — operates on a PRD-to-code loop. Andrej Karpathy coined the term in February 2025: fully giving in to the vibes, describing what you want, watching the AI generate it. Fast, fluid, and genuinely impressive on first contact.

“The distinction that matters is between intent and behaviour.”

A PRD is intent — it tells you what to build. A specification is behaviour — inputs, outputs, edge cases, error conditions, invariants, acceptance criteria. Without a behavioural contract as source material, test generation isn’t deriving tests from requirements. It is asking the model to invent requirements.

The PRD-to-code loop is not a shortcut to the spec-driven loop. It is a different destination — one that produces working prototypes faster and production-quality systems slower.

The fork — PRD to code vs PRD to spec to verified code

The fork that determines whether everything downstream means anything

× The PRD Loop

PRD (intent document)

↓ LLM generates code from intent

Code

↓ LLM writes tests from code

Tests that mirror implementation

↓ Tests pass at 100% coverage

✓ “Green” — not verified

TestGen-LLM: 1 useful test per 20 generated. A green-light generator, not a verification pipeline.

✓ The Spec-Driven Loop

PRD (intent document)

↓ Spec: inputs, outputs, edge cases, invariants

Specification (behaviour contract)

↓ Tests derived from spec, reviewed adversarially

Behavioural tests

↓ Code must satisfy tests — target is clear

✓ Verified behaviour

The spec is the source of truth every downstream stage refers to — test generation, code review, security gate.

Spec discipline — the gate between intent and production quality

Why it matters at scale

The review posture that works at Stage 3 becomes a fiction at Stage 5

Teams say: we review the code carefully; we have experienced developers. That is true at Stage 3, where one developer reviews one task. It is not architecturally possible at Stage 5, where one developer governs six concurrent automated pipelines.

The specification is the mechanism by which the behavioural contract survives parallelisation. It exists independently of any single developer’s knowledge of the task. It can be reviewed adversarially by a model that never wrote the code.

GitHub Engineering, September 2025

“The specification becomes the source of truth.” Thoughtworks named spec-driven development one of their key new AI-assisted engineering practices for 2025.

Georgetown CSET

45% of AI-generated code contains exploitable security vulnerabilities — absent spec discipline and adversarial review gates.

50–200×

Cost to fix a defect found in production vs the requirements phase

Boehm — 50 years of software research

2.74×

More security vulnerabilities in AI code without spec discipline

CodeRabbit · December 2025

16 of 18

CTOs reported production disasters directly traceable to AI-generated code

Final Round AI · August 2025

Stage 4

Partial Pipeline

Requires building something

Fewer teams reach this stage — and the ones that do generally discover they cannot get here by prompting harder.

The shift is structural. Spec writing, test generation, and code generation begin running without continuous human oversight. The developer reviews PRDs, monitors pipeline progress, and signs off at key decision points.

“A Stage 4 workflow is not a better Stage 3 workflow. It is a different artefact: a designed process, not a conversation.”

Spec writing automated
Test generation automated (TDD)
Code generation automated
Human signs off at gates — not every step
Partial debug loop automated

Stage 4 — partial pipeline with gate sign-offs

Key realisation

Teams that realise Stage 4 requires building something early spend their time on architecture. Teams that realise it late spend their time on retrofitting.

Stage 5

Full Pipeline, Fixed Configuration

Governing pipelines, not writing code

The developer's role has shifted entirely

At Stage 3 — writing code

Writing and reviewing code line by line
One task at a time, sequentially
Context-switching constantly
Deeply involved in every decision
Review becomes the bottleneck

At Stage 5 — governing pipelines

Defining tasks and reviewing PRDs
2–6 concurrent pipeline tasks
Setting standards and scoring rubrics
Configuring model routing and agent teams
Monitoring gates, unblocking stalls

The automated sequence per task

Spec Write

→

Spec Review

→

■ Gate

→

Write Tests

→

Test Review

→

■ Gate

→

Code Gen

→

Code Review

→

■ Quality

→

Security

→

Deploy ✓

Failed gates → automated remediation → max retries exceeded → escalate to human

4–6 tasks concurrently — parallelism compounds in ways individual bandwidth cannot

Pipeline ships overnight while team sleeps

Agents don’t have standups. Code is ready for review by morning.

Test gaming and circular test patterns at Stage 5

The constraint that defines Stage 5

Everything runs at one speed

A proof-of-concept spike, a piece of internal tooling, and a compliance-critical financial system all go through the same number of gates, the same review depth, the same model selection. For a hackathon exploration? That is an expensive spike. This is the ceiling that Stage 6 removes.

⚠ What breaks in practice at Stage 5

🎯

Reward hacking & test gaming

Language models optimise for passing tests, not correct behaviour. METR’s 2025 research documented monkey-patching timing functions, scavenging pre-computed results, special-casing test inputs. GPT-5 exploited test cases in 76% of runs without explicit prompt constraints.

Fix: Restrict model access to test files during code generation. Review tests adversarially before code generation runs.

🔄

Circular tests

The model writes tests that mirror the implementation rather than validate behaviour. Coverage metrics read healthy. Behavioural validation is zero. TestGen-LLM: 1 meaningful assertion per 20 test cases. A pipeline built on circular tests is not a verification pipeline.

Fix: The gate between test generation and code generation is structurally load-bearing — not optional.

📏

Bare minimum compliance

The model writes the minimum code required to produce a green test run. Edge cases not covered by tests do not exist to the model. Remove the specification and the chain collapses — the model completes in the narrowest possible sense.

Fix: The spec defines what “complete” means. Without it there is no source of truth for completeness.

⏰

Stall detection failures

Remediation loops without explicit iteration caps will retry the same failed operation indefinitely. Most teams discover this after watching a pipeline spin for a day on a problem resolvable in thirty minutes with a human in the loop.

Fix: Designing the escalation architecture is as important as designing the generation architecture.

Stage 6

Spec-Driven Pipeline Synthesis

The pipeline itself becomes an artifact derived from requirements

Stage 5 runs a fixed pipeline. Stage 6 builds the pipeline from the specification.

Given a specification or intent, the system determines what artifacts are required, what production workflow is needed, and generates the complete pipeline: phases, agents, validators, gates, retry logic, and escalation rules. The system constructs the factory, not just the product.

The observation that emerges consistently: test writing dominates autonomous pipeline work — often approaching half of all pipeline activity. This is counterintuitive. Implementation converges quickly once tests precisely define the required behaviour. The hard work is producing tests that genuinely reflect the behavioural contract.

Structural validators — lint, static analysis, schema validation
Operational validators — performance benchmarks, deployability, resilience
Policy validators — security scanning, compliance, accessibility

Stage 6 — configurable pipeline synthesis from specifications

TDD at Stage 6

The test review gate between test generation and code generation is structurally load-bearing. It’s where circular tests, test-scope gaps, and interface mismatches are caught before they propagate into code generation.

Configurable rigor profiles

Pipeline behaviour at Stage 6 is controlled by named profiles — a configuration file, not a code change. A multi-spec feature slice that takes a day at Stage 5 can complete in hours at Stage 6 using an appropriate profile.

Profile	Use Case	Relative Cost	Wall-clock	Tickets
poc	Hackathon, spike, exploration	Minimal	2–5 min	None
mvp	Demo, prototype, first working version	Low	5–10 min	None
team	Internal tooling, trusted team	Moderate	10–20 min	MAJOR+ tracked
org	Multi-team shared codebase	Moderate-high	15–30 min	MAJOR+ tracked, CRITICAL blocks
saas	Customer-facing product	High	20–45 min	All tracked
mission-critical	Financial, compliance, security	Full	30–180 min	All block

New profiles require a configuration file, not code changes. Profile validation at startup catches missing values before they silently weaken a strict profile.

Two design decisions that define Stage 6

The pipeline itself — phase sequencing, state management, stall detection, score parsing, retry logic — runs as deterministic scripts with no LLM in the control loop. And each pipeline role gets a dedicated agent with strict tool access boundaries.

Over-management becomes structurally impossible — not because agents are told not to over-manage, but because they do not have the tools to do it.

Decision 01

Deterministic orchestration

Phase sequencing, state management, stall detection, score parsing, and retry logic run as deterministic scripts — no LLM in the control loop. LLMs are invoked inside the pipeline as reviewers, builders, and remediators. The pipeline’s own decisions are computed from state files and score diffs.

LLM orchestrators routinely over-manage: stopping and restarting workers mid-task, adding arbitrary timeouts, killing slow processes. Deterministic orchestration is faster, cheaper, and reproducible. When the pipeline produces an unexpected outcome, root-cause analysis is tractable.

Decision 02

Specialised agents with constrained permissions

Each pipeline role gets a dedicated agent definition with strict tool access boundaries. A builder agent cannot stop the pipeline. A reviewer cannot modify infrastructure. A test writer cannot skip the review gate.

Over-management becomes structurally impossible — not because agents are told not to over-manage, but because they do not have the tools to do it. This is the critical distinction: a behavioural constraint can be violated; an architectural constraint cannot.