Most teams plateau at Stage 3. The ones that don't designed their way past it.
Free 30-minute assessment. No commitment.
A fully transparent, autonomous AI development pipeline — every stage visible, every output verified
The specifics vary — different languages, different stacks, different org structures — but the pattern is remarkably consistent. Teams move through recognisable stages, each one representing a concrete shift in what is automated, what is verified, and what the developer's role actually is.
Most teams plateau somewhere between Stage 2 and Stage 3. Getting past that plateau is not a matter of trying harder or prompting more cleverly. It requires a different design.
This article covers Stages 1–6. Stages 7–9 describe an expanding sphere of awareness — the system knowing itself, the organisation it serves, and the world around it — and are covered in the companion article.
At Stage 5, the developer governs pipelines — not writes code
In the first three stages, AI amplifies individual output — but the channel is narrow. One developer, one task, one conversation at a time. The ceiling is absolute.
A faster keyboard
The tool is a faster keyboard, not a different workflow. The developer writes code; the AI completes lines and functions based on context. Nothing is automated. Full cognitive load remains with the developer.
Writing time → review time
The developer describes what needs changing — the AI rewrites functions, files, and components on instruction. Output volume grows. Review burden grows with it. For small changes the trade is favourable; for large rewrites, review cost dominates.
Conversational back-and-forth
The workflow is conversational. The developer prompts a brief, reviews a PRD, approves an approach, then tests and debugs with AI assistance. The productivity gain is real — but not structural. One developer, one conversation, one task.
Most teams hereThe ceiling at Stage 3 is attention bandwidth — absolute and unbreakable. No amount of better prompting raises it. The transition to Stage 4 requires building something, not using something. That distinction is what teams consistently underestimate.
There is a decision point between Stage 3 and Stage 4 that does not announce itself as such. Teams that miss it don't know they've missed it until they're six months into pipeline work wondering why verification keeps failing.
The entire orientation of current AI tooling — Cursor, Copilot, Windsurf, every “vibe coding” workflow — operates on a PRD-to-code loop. Andrej Karpathy coined the term in February 2025: fully giving in to the vibes, describing what you want, watching the AI generate it. Fast, fluid, and genuinely impressive on first contact.
“The distinction that matters is between intent and behaviour.”
A PRD is intent — it tells you what to build. A specification is behaviour — inputs, outputs, edge cases, error conditions, invariants, acceptance criteria. Without a behavioural contract as source material, test generation isn’t deriving tests from requirements. It is asking the model to invent requirements.
The PRD-to-code loop is not a shortcut to the spec-driven loop. It is a different destination — one that produces working prototypes faster and production-quality systems slower.
The fork that determines whether everything downstream means anything
Teams say: we review the code carefully; we have experienced developers. That is true at Stage 3, where one developer reviews one task. It is not architecturally possible at Stage 5, where one developer governs six concurrent automated pipelines.
The specification is the mechanism by which the behavioural contract survives parallelisation. It exists independently of any single developer’s knowledge of the task. It can be reviewed adversarially by a model that never wrote the code.
“The specification becomes the source of truth.” Thoughtworks named spec-driven development one of their key new AI-assisted engineering practices for 2025.
45% of AI-generated code contains exploitable security vulnerabilities — absent spec discipline and adversarial review gates.
Fewer teams reach this stage — and the ones that do generally discover they cannot get here by prompting harder.
The shift is structural. Spec writing, test generation, and code generation begin running without continuous human oversight. The developer reviews PRDs, monitors pipeline progress, and signs off at key decision points.
“A Stage 4 workflow is not a better Stage 3 workflow. It is a different artefact: a designed process, not a conversation.”
Teams that realise Stage 4 requires building something early spend their time on architecture. Teams that realise it late spend their time on retrofitting.
Governing pipelines, not writing code
Failed gates → automated remediation → max retries exceeded → escalate to human
4–6 tasks concurrently — parallelism compounds in ways individual bandwidth cannot
Agents don’t have standups. Code is ready for review by morning.
A proof-of-concept spike, a piece of internal tooling, and a compliance-critical financial system all go through the same number of gates, the same review depth, the same model selection. For a hackathon exploration? That is an expensive spike. This is the ceiling that Stage 6 removes.
Language models optimise for passing tests, not correct behaviour. METR’s 2025 research documented monkey-patching timing functions, scavenging pre-computed results, special-casing test inputs. GPT-5 exploited test cases in 76% of runs without explicit prompt constraints.
The model writes tests that mirror the implementation rather than validate behaviour. Coverage metrics read healthy. Behavioural validation is zero. TestGen-LLM: 1 meaningful assertion per 20 test cases. A pipeline built on circular tests is not a verification pipeline.
The model writes the minimum code required to produce a green test run. Edge cases not covered by tests do not exist to the model. Remove the specification and the chain collapses — the model completes in the narrowest possible sense.
Remediation loops without explicit iteration caps will retry the same failed operation indefinitely. Most teams discover this after watching a pipeline spin for a day on a problem resolvable in thirty minutes with a human in the loop.
The pipeline itself becomes an artifact derived from requirements
Stage 5 runs a fixed pipeline. Stage 6 builds the pipeline from the specification.
Given a specification or intent, the system determines what artifacts are required, what production workflow is needed, and generates the complete pipeline: phases, agents, validators, gates, retry logic, and escalation rules. The system constructs the factory, not just the product.
The observation that emerges consistently: test writing dominates autonomous pipeline work — often approaching half of all pipeline activity. This is counterintuitive. Implementation converges quickly once tests precisely define the required behaviour. The hard work is producing tests that genuinely reflect the behavioural contract.
The test review gate between test generation and code generation is structurally load-bearing. It’s where circular tests, test-scope gaps, and interface mismatches are caught before they propagate into code generation.
Pipeline behaviour at Stage 6 is controlled by named profiles — a configuration file, not a code change. A multi-spec feature slice that takes a day at Stage 5 can complete in hours at Stage 6 using an appropriate profile.
| Profile | Use Case | Relative Cost | Wall-clock | Tickets |
|---|---|---|---|---|
| poc | Hackathon, spike, exploration | Minimal | 2–5 min | None |
| mvp | Demo, prototype, first working version | Low | 5–10 min | None |
| team | Internal tooling, trusted team | Moderate | 10–20 min | MAJOR+ tracked |
| org | Multi-team shared codebase | Moderate-high | 15–30 min | MAJOR+ tracked, CRITICAL blocks |
| saas | Customer-facing product | High | 20–45 min | All tracked |
| mission-critical | Financial, compliance, security | Full | 30–180 min | All block |
New profiles require a configuration file, not code changes. Profile validation at startup catches missing values before they silently weaken a strict profile.
The pipeline itself — phase sequencing, state management, stall detection, score parsing, retry logic — runs as deterministic scripts with no LLM in the control loop. And each pipeline role gets a dedicated agent with strict tool access boundaries.
Over-management becomes structurally impossible — not because agents are told not to over-manage, but because they do not have the tools to do it.
Phase sequencing, state management, stall detection, score parsing, and retry logic run as deterministic scripts — no LLM in the control loop. LLMs are invoked inside the pipeline as reviewers, builders, and remediators. The pipeline’s own decisions are computed from state files and score diffs.
LLM orchestrators routinely over-manage: stopping and restarting workers mid-task, adding arbitrary timeouts, killing slow processes. Deterministic orchestration is faster, cheaper, and reproducible. When the pipeline produces an unexpected outcome, root-cause analysis is tractable.
Each pipeline role gets a dedicated agent definition with strict tool access boundaries. A builder agent cannot stop the pipeline. A reviewer cannot modify infrastructure. A test writer cannot skip the review gate.
Over-management becomes structurally impossible — not because agents are told not to over-manage, but because they do not have the tools to do it. This is the critical distinction: a behavioural constraint can be violated; an architectural constraint cannot.
Each stage is a concrete capability. Stages 7–9 describe something different: an expanding sphere of awareness — the system knowing itself, the organisation it serves, and the environment that organisation exists in.
AI orchestration consulting. From strategy to working system. Thirty years of engineering discipline applied to making AI agents reliable.