7-Step Pipeline to Verify Code Written by AI Agents
When agents push 3,000 commits a day, humans can't review them all. Here's how to build a machine-verified pipeline that catches what people can't.
Peter, a developer at OpenClaw, sometimes pushes over 3,000 commits in a single day. No human review process scales to that volume. Ryan Carson’s “Code Factory” post lays out a workable answer: instead of reading everything, you build a structure where machines verify the code. The seven steps below come from that design, along with a few additions from the broader tooling ecosystem.
One honest caveat upfront: this pipeline catches a lot, but it doesn’t catch everything. Model-pinning reduces drift; it doesn’t eliminate it. Browser evidence prevents false positives on visual regressions; it still misses interaction bugs that only surface in production. The goal is a system that fails loudly and traceably, not one that claims to be infallible.
Define Merge Rules in a Single JSON File
Write down which paths are high-risk and which checks must pass, all in one file. The key insight is that this keeps documentation and scripts in sync. When the rules live in separate places, they drift.
- High-risk paths require a Review Agent plus browser-based evidence
- Low-risk paths can merge after passing a policy gate and CI alone
Run Qualification Checks Before CI
Running builds on PRs that haven’t even passed review burns money. A risk-policy-gate in front of CI fanout cuts unnecessary CI costs significantly.
- Fixed order: policy gate → Review Agent confirmation → CI fanout
- Unqualified PRs never enter the test/build stage
Never Trust a Pass from a Stale Commit
This is what Carson emphasized most. If a pass from an old commit lingers, the latest code merges without verification. Re-run reviews on every push, and block the gate if the results don’t match the current head.
- A Review Check Run is valid only when it matches the
headSha - Force a rerun on every
synchronizeevent
Issue Rerun Requests from Exactly One Source
When multiple workflows request reruns, you get duplicate comments and race conditions. It looks like a minor edge case, but unsolved it destabilizes the entire pipeline.
- Prevent duplicates with a
Marker + sha:headShapattern - Skip the request if the SHA was already submitted
Let Agents Handle the Fixes Too
When the Review Agent finds a problem, the Coding Agent patches it and pushes to the same branch. Carson’s sharpest practical note here: pin the model version. Without it, the same prompt produces different results across runs and reproducibility disappears.
- Codex Action fixes → push → rerun trigger
- Pinned model versions ensure reproducibility
Only Auto-Close Bot-to-Bot Conversations
Never touch threads where a human participated. Without this distinction, reviewer comments get buried under automated noise.
- Auto-resolve only after a clean current-head rerun
- Threads with human comments stay open, always
Leave Visible, Verifiable Evidence
If the UI changed, a screenshot is not enough. Require CI-verifiable evidence. Turn production incidents into test cases so the same failure doesn’t repeat silently.
- Regression → harness gap issue → add test case → SLA tracking
Carson’s Tool Choices
For reference, Carson selected Greptile as the code review agent and Codex Action for remediation. Three workflow files handle the heavy lifting: greptile-rerun.yml for canonical reruns, greptile-auto-resolve-threads.yml for stale thread cleanup, and risk-policy-gate.yml for preflight policy.
Visual Verification
Everything above catches whether code is right or wrong. But in practice, you also need to verify how the output looks.
Nico Bailon’s visual-explainer renders terminal diffs as HTML pages instead of ASCII, making change sets immediately readable at a glance.
Chris Tate’s agent-browser takes a different direction. It compares actual browser screens pixel by pixel to catch CSS and layout breakage. Combined with bisect, it can pinpoint exactly which commit caused the regression.
I’ve been thinking about this while building codexBridge. Tracking which agent wrote which code isn’t enough with just session logs. You need a search structure that makes it easy to retrieve the right context later.
Who Verifies Agent-Written Code
The answer is not humans. It’s a structure where machines judge the evidence that machines produced.
Join the newsletter
Get updates on my latest projects, articles, and experiments with AI and web development.