Index
4 min read 2026 Updated Feb 18, 2026

How OpenAI Built 1 Million Lines of Code Using Only Agents: 5 Harness Engineering Principles

OpenAI's Codex team built a 1M-line codebase using only AI agents. Here are the five harness engineering principles they discovered along the way.

A harness is the tool shell that allows an AI agent to affect the real world. If the reasoning model is the brain, the harness is the hands and feet: reading files, fixing code, running tests, deploying to production. The quality of that shell determines what the agent can actually accomplish.

An internal OpenAI team started from an empty repository in late August 2025 and built a 1-million-line product using only Codex agents, with no human-written code. They reported it took one-tenth the time compared to doing it manually. Whether that estimate holds for other teams or other codebases is an open question, but the five principles they extracted are worth examining on their own terms.

What the Agent Can’t See Doesn’t Exist

From Codex’s perspective, information it can’t access at runtime might as well not exist. Planning docs in Google Docs, architecture decisions agreed upon in Slack, tacit knowledge inside someone’s head: none of it is visible. The situation is the same as a new hire joining three months from now would face.

The team’s response was to push every decision into the repository as markdown, schemas, and execution plans called ExecPlans. An ExecPlan is a self-contained design document defined in PLANS.md, written to a standard where a beginner could read it and implement the feature end to end. The structure extends matklad’s ARCHITECTURE.md concept for agent use. There were cases where Codex worked continuously for over 7 hours on a single prompt, which only works when the context is complete and stable.

Ask What Capability Is Missing, Not Why the Agent Is Failing

Early in the project, agent velocity was slower than expected. The cause was not model performance; it was an under-equipped environment. Each time something failed, the team asked: “What capability is missing, and how do we make it readable and verifiable by the agent?” That reframe shifted the work from prompting the agent harder toward instrumenting the environment better.

They built custom concurrency helpers instead of reaching for external libraries, achieving full integration with OpenTelemetry. Stable, well-documented “boring technology” turned out to favor agents because API stability and higher representation in training data make the agent’s behavior more predictable.

Mechanical Enforcement Over Documentation

Documentation alone could not keep an agent-generated codebase consistent. The team chose to enforce invariant rules mechanically rather than prescribe implementation details in text. They mandated parsing at data boundaries but left the choice of library to the agent. Architecture was locked into a layered domain structure with dependency directions verified by linters.

The fixed layers per business domain run Providers, then Service, then Runtime, then UI. Types, Config, and Repo are shared as cross-cutting concerns at lower levels. Custom linters and structural tests fail the build immediately on violation. The linters themselves were written by Codex. What this achieves is a codebase where the agent cannot accidentally violate structural rules, even over long unattended runs.

Give the Agent Eyes

The team connected Chrome DevTools Protocol to the agent runtime, giving Codex access to DOM snapshots, screenshots, and navigation capabilities. A comparison of pre- and post-task snapshots, combined with runtime event observation, lets the agent apply fixes in a loop until everything is clean. Single Codex runs regularly sustained focus on one task for over 6 hours.

Observability tools were attached the same way. A temporary observability stack spins up per git worktree and disappears when the work is done. Victoria Logs and Victoria Metrics let the agent query logs and metrics directly, which means prompts like “make the service start in under 800ms” become executable instructions rather than aspirational notes.

A Map, Not a Manual

Context management determines agent effectiveness. The team initially tried putting everything into one massive AGENTS.md file. It failed. The principle from matklad’s 2021 ARCHITECTURE.md proved relevant here: provide a brief bird’s-eye view of the project structure, including only what rarely changes. The same principle applies to agents.

An ARCHITECTURE.md is a code map, not a code atlas. Architectural invariants are often expressed as “something does not exist here,” which is counterintuitive but effective. Stating boundaries explicitly constrains all downstream implementation in a way that long documentation cannot.

What Remains Unresolved

Even for the Codex team, some questions have no answer yet. Whether a system built entirely by agents can maintain architectural consistency over years is unknown. How this framework itself needs to change as models improve is also unclear. The 1-million-line number is real, but it represents a single internal project under controlled conditions. Extrapolating from it requires caution.

The shift in focus from writing code to designing the environment in which agents write code is the durable idea here, regardless of how the specific tooling evolves.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.