My Agent Called a Failed API 5 Times—The Bug Wasn't in the Code
When an agent repeats the same failing API call, code review won't help. Traces are the new source code for debugging AI agents.
An agent repeating the same failing API call five times looks like a retry bug. Check the retry logic: fine. Check the function flow: normal. Check the logs: nothing. The code is not where the bug lives.
Agent code is an empty vessel
Open any agent’s source code and you’ll find a model specification, a list of tools, and a system prompt. Which tool to call when, what reasoning sequence to follow — none of that lives in the code.
Teams running LangGraph-based agents say the same thing repeatedly: “You can’t judge agent quality through code review.”
- Same code, same input, different tool call patterns every time
- Unlike a function like
handleSubmit(), the branching logic simply doesn’t exist in the code - Testing GPT-5.2 with the same query 10 times yields roughly 40% consistency in tool call ordering
- When errors occur, there’s no bug in the code, making reproduction impossible
In traditional software, the code is the behavior. In agents, the code is just the scaffolding. The actual behavior emerges at runtime, shaped by the model’s reasoning over whatever context it receives.
Traces are the new source code
A trace records every step the agent takes: what it reasoned at each point, which tool it called, and why. The debugging, testing, and performance analysis we used to do through code now has to happen through traces.
When an agent sees an error message and repeats the same call anyway, that’s a reasoning failure, not a code bug. You can only see it in the trace.
- Comparing traces before and after a prompt change reveals reasoning quality differences instantly
- In LangSmith, loading a trace from a specific point into the playground works like setting a breakpoint
- A single trace can show you the exact moment the agent’s reasoning went off track, something no amount of logging can replicate
Traditional debugging is reading a recipe to find the mistake. Agent debugging is watching the kitchen footage to see where the chef went wrong. The recipe might be perfect. The execution is where things break.
Testing fundamentally changes
In traditional software, you test before deployment and you’re done. Agents are non-deterministic, so you have to keep evaluating in production. And even with good traces, building reliable eval datasets takes time — the tooling is still immature, and coverage will always lag behind the variety of real user behavior.
Without a pipeline that collects traces, builds eval datasets, and catches quality degradation or drift, you cannot operate agents at scale.
- Build an automated eval pipeline that samples production traces weekly
- Pre-deployment testing alone cannot guarantee quality for non-deterministic systems
- Monitoring without traces is like only checking whether the server is running
- An agent can be “working normally” while executing completely wrong tasks — only traces catch this
Collaboration and product analytics happen on traces too
Code review happens on GitHub. Agent judgment review happens on observability platforms. Teams are commenting on traces, sharing specific decision points, and reviewing agent reasoning the way they used to review pull requests.
Product analytics follows the same pattern. When a metric says “30% of users are dissatisfied,” you can’t find the cause without opening traces. The agent might be completing tasks successfully by its own measure while completely missing what the user actually wanted.
- Product analytics tools like Mixpanel and debugging tools are converging on traces as the shared substrate
- Analyzing agent tool call patterns can reverse-engineer what features users actually need
Code is the blueprint; traces are the footage
When something goes wrong in a building, you don’t unfold the blueprint first. You rewind the security camera footage.
The teams getting agent quality right are the ones that shifted their center of gravity from code to traces. Not because code doesn’t matter, but because the interesting failures — the ones that cost you users and money — live in the runtime behavior that only traces capture.
Join the newsletter
Get updates on my latest projects, articles, and experiments with AI and web development.