February 25, 2026 4 min read 2026

7-Step Pipeline to Verify Code Written by AI Agents

When agents push 3,000 commits a day, humans can't review them all. Here's how to build a machine-verified pipeline that catches what people can't.

Peter, a developer at OpenClaw, sometimes pushes over 3,000 commits in a single day. No human review process scales to that volume. Ryan Carson’s “Code Factory” post lays out a workable answer: instead of reading everything, you build a structure where machines verify the code. The seven steps below come from that design, along with a few additions from the broader tooling ecosystem.

One honest caveat upfront: this pipeline catches a lot, but it doesn’t catch everything. Model-pinning reduces drift; it doesn’t eliminate it. Browser evidence prevents false positives on visual regressions; it still misses interaction bugs that only surface in production. The goal is a system that fails loudly and traceably, not one that claims to be infallible.

Define Merge Rules in a Single JSON File

Write down which paths are high-risk and which checks must pass, all in one file. The key insight is that this keeps documentation and scripts in sync. When the rules live in separate places, they drift.

High-risk paths require a Review Agent plus browser-based evidence
Low-risk paths can merge after passing a policy gate and CI alone

Run Qualification Checks Before CI

Running builds on PRs that haven’t even passed review burns money. A risk-policy-gate in front of CI fanout cuts unnecessary CI costs significantly.

Fixed order: policy gate → Review Agent confirmation → CI fanout
Unqualified PRs never enter the test/build stage

Never Trust a Pass from a Stale Commit

This is what Carson emphasized most. If a pass from an old commit lingers, the latest code merges without verification. Re-run reviews on every push, and block the gate if the results don’t match the current head.

A Review Check Run is valid only when it matches the headSha
Force a rerun on every synchronize event

Issue Rerun Requests from Exactly One Source

When multiple workflows request reruns, you get duplicate comments and race conditions. It looks like a minor edge case, but unsolved it destabilizes the entire pipeline.

Prevent duplicates with a Marker + sha:headSha pattern
Skip the request if the SHA was already submitted

Let Agents Handle the Fixes Too

When the Review Agent finds a problem, the Coding Agent patches it and pushes to the same branch. Carson’s sharpest practical note here: pin the model version. Without it, the same prompt produces different results across runs and reproducibility disappears.

Codex Action fixes → push → rerun trigger
Pinned model versions ensure reproducibility

Only Auto-Close Bot-to-Bot Conversations

Never touch threads where a human participated. Without this distinction, reviewer comments get buried under automated noise.

Auto-resolve only after a clean current-head rerun
Threads with human comments stay open, always

Leave Visible, Verifiable Evidence

If the UI changed, a screenshot is not enough. Require CI-verifiable evidence. Turn production incidents into test cases so the same failure doesn’t repeat silently.

Regression → harness gap issue → add test case → SLA tracking

Carson’s Tool Choices

For reference, Carson selected Greptile as the code review agent and Codex Action for remediation. Three workflow files handle the heavy lifting: greptile-rerun.yml for canonical reruns, greptile-auto-resolve-threads.yml for stale thread cleanup, and risk-policy-gate.yml for preflight policy.

Visual Verification

Everything above catches whether code is right or wrong. But in practice, you also need to verify how the output looks.

Nico Bailon’s visual-explainer renders terminal diffs as HTML pages instead of ASCII, making change sets immediately readable at a glance.

Chris Tate’s agent-browser takes a different direction. It compares actual browser screens pixel by pixel to catch CSS and layout breakage. Combined with bisect, it can pinpoint exactly which commit caused the regression.

I’ve been thinking about this while building codexBridge. Tracking which agent wrote which code isn’t enough with just session logs. You need a search structure that makes it easy to retrieve the right context later.

Who Verifies Agent-Written Code

The answer is not humans. It’s a structure where machines judge the evidence that machines produced.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.