March 5, 2026 9 min read 2026 Updated Mar 12, 2026

How Codex Solves the Compaction Problem Differently

I reverse-engineered how Codex handles context overflow compared to Claude Code. The answer involves AES encryption, session handover patterns, and KV cache tricks.

If you’ve used Claude Code for any serious coding session, you’ve seen it: “Compacting conversation…” appears in your terminal, and from that point forward, something feels off. The model starts forgetting things you discussed ten minutes ago. Response latency climbs. You ask about a function you just refactored together, and it responds as if hearing about it for the first time.

This happens because Claude Code’s 200K token context window fills up faster than most people expect. One large refactoring session, a few file reads, some tool calls with verbose output, and you’re at capacity. When that threshold hits (roughly 75-92% of the window, though I’ve seen it trigger as early as 65%), Claude Code summarizes the conversation, drops the original messages, and continues with just the summary. The information that didn’t make it into the summary is gone.

I kept hearing that OpenAI’s Codex handles this differently, so I spent time pulling apart every public analysis I could find. The most interesting work came from Kangwook Lee, CAIO at Krafton, who used prompt injection to reverse-engineer the actual pipeline.

What compaction loses and why it matters

The core problem is straightforward. Summarization is lossy compression. When Claude Code compacts, it runs a background summarization of the full conversation, creates a compaction block, and discards everything before it. CLAUDE.md files survive because they’re re-read from disk, but anything you said only in conversation disappears unless the summarizer captured it.

Tool call results are where this hurts most. When you ask Claude Code to read a file, the full file content enters the context. When you run a command, the full output enters the context. These tool results are often the most information-dense parts of the conversation, and they’re exactly what gets flattened during summarization. A 500-line file read becomes a single sentence like “read the configuration file and noted the database settings.” The specific values, the edge cases you discussed, the line numbers you referenced are all gone.

I’ve watched this happen dozens of times. After compaction, I ask “what was the return type of that helper function we looked at?” and get a confidently wrong answer. The model isn’t hallucinating in the usual sense. It’s working from a summary that genuinely doesn’t contain what I’m asking about.

After 9 or more compactions in a long session, the problem compounds. Each summary compresses the previous summary further. Decision rationale from early in the session erodes completely. By hour 10 of a session, the model has no memory of why you chose approach A over approach B, even if you spent twenty minutes discussing the trade-offs.

Inside Codex’s encrypted compaction pipeline

Kangwook Lee’s analysis was clever. He used two chained prompt injections to extract the internal behavior of Codex’s compaction system.

The first injection targeted the compactor LLM itself. When Codex triggers compaction, it doesn’t just summarize locally. It sends the conversation to a separate LLM on OpenAI’s servers, which produces a summary. Lee’s injection tricked this compactor into including its own system prompt in the summary output. The server then AES-encrypted this summary (now containing the leaked prompt) and returned it as an opaque blob.

The second injection exploited the decryption step. By passing the encrypted blob plus a crafted user message back to the Responses API, the server decrypted the blob and assembled the model’s context. Since the first injection had embedded the compactor’s system prompt inside the summary, the decrypted context revealed how the entire pipeline works.

Here’s what he found: when you call Codex’s compact() API, a separate LLM summarizes the conversation, and the result comes back AES-encrypted. On the next turn, the server decrypts this blob, prepends a handoff prompt (“here’s a summary of the previous conversation”), and feeds the whole thing to the model. The encryption key lives on OpenAI’s servers. The client never sees the plaintext summary.

The compaction prompt itself turned out to be nearly identical to the open-source Codex CLI’s compaction template for non-Codex models. No secret sauce in the prompt engineering. The interesting part is the architecture: server-side encryption of summaries, server-side decryption and injection, and an opaque blob that the client passes around without being able to inspect or modify it.

Why encrypt at all? Lee’s analysis didn’t definitively answer this. One theory is that the encrypted blob contains more than just a text summary: tool call restoration data, internal state markers, or structured metadata that OpenAI doesn’t want exposed. Another possibility is simply that encrypted blobs prevent users from tampering with the summary to manipulate the model’s behavior. I find the second explanation more likely, but neither is confirmed.

OpenAI also supports this server-side through the Responses API. Set a compact_threshold value, and when the token count crosses it, the server runs compaction inline. The compaction item streams back within the response, and you append it to subsequent requests. Items before the most recent compaction item can be safely dropped.

Contrast this with Claude Code’s approach: the compaction block is human-readable. You can inspect it, and you can customize the compaction behavior through the instructions parameter or by adding custom compaction directives to CLAUDE.md. More transparent, but the same fundamental information loss applies.

The session handover pattern

Compaction mechanics aside, the more interesting problem is what happens when you need to start a new session without losing context. This is where I saw a developer’s automation that changed how I think about the problem.

The pattern works like this. Right before compaction triggers, a pre-compact hook blocks all write tools. This prevents the model from making code changes while it’s in a partially-aware state, which is a failure mode I’ve hit multiple times: compaction fires mid-refactor, the model loses track of which files it already changed, and writes conflicting edits.

With writes blocked, the system extracts only user messages and thinking blocks from the JSONL session log. Everything else (tool calls, file contents, assistant responses) gets dropped. This cuts the log to about 2% of its original size.

Then three sub-agents run in parallel, each searching the original uncompressed JSONL logs for information that the extraction missed. They’re looking for gaps: architectural decisions that were discussed but not captured in user messages, error patterns that only appeared in tool output, rationale for approaches that were rejected. These agents compile their findings into a resume-prompt.md file that contains the session summary, the gap analysis results, and a list of modified files.

A VS Code file watcher detects the new resume-prompt.md and opens a fresh session that loads it as initial context. The new session starts with a clear, complete picture of where the previous session left off.

The reported improvement was 10x in build efficiency. That number is hard to verify independently, but the architecture makes sense. Instead of one increasingly degraded summary, you get a fresh context window with a curated, gap-checked handover document.

I tried implementing a simpler version of this myself. The gap analysis step is where the value concentrates. Without it, you’re just doing what compaction already does but in a different format. With it, you’re actively recovering information that summarization lost. My version uses a single sub-agent instead of three, and the results are noticeably better than raw compaction but probably not as thorough as the full three-agent approach.

KV cache as the hidden cost lever

There’s a performance dimension to this that most discussions miss entirely. KV cache (the key-value pairs computed during attention) can be reused across requests when the prompt prefix is identical. Two requests sharing the same opening tokens skip recomputation for those tokens.

The numbers are significant. In a controlled test comparing stable vs. perturbed system prompts, stable prefixes achieved 85% cache hit rates with median time-to-first-token of 953ms. Perturbed prefixes: 0% cache hits, 2,727ms TTFT. Cost per request dropped from $0.033 to $0.009. That’s a 65% latency reduction and 71% cost reduction just from keeping the prompt prefix consistent.

This has direct implications for the session handover pattern. If your resume-prompt.md always starts with the same structural prefix (system prompt, handoff instructions, then variable content), the fixed portion gets cached. Every subsequent request in the new session benefits from that cache. If you randomize the prefix structure or inject variable content early, every request recomputes from scratch.

I designed my session folder structure around this insight. Session-id-based archiving keeps handover documents organized, and the fixed-prefix convention for resume prompts means the first 40-50K tokens of every new session hit the KV cache. Pre-indexing session archives with QMD (a tool I covered separately) makes the retrieval step faster when sub-agents need to search historical sessions.

What actually matters here

The real takeaway isn’t that Codex’s approach is better or worse than Claude Code’s. Both lose information during compaction. Both struggle with long sessions. The architectural difference (encrypted opaque blob vs. human-readable compaction block) reflects different design philosophies, but the fundamental limitation is the same: context windows are finite, and summarization is lossy.

What matters is what you build around that limitation. The session handover pattern, gap analysis, JSONL-based retrieval, KV cache optimization: these are engineering solutions to a problem that no amount of model improvement will fully solve. A 500K or 1M token context window delays the problem but doesn’t eliminate it.

The bottleneck in AI coding tools isn’t model intelligence. It’s context management. I’ve seen this firsthand: a mediocre summary with good retrieval outperforms an excellent summary with no retrieval every time. Building systems that retrieve forgotten information reliably matters more than building systems that summarize more accurately.

Technical details sourced from Kangwook Lee’s analysis and public API documentation from both OpenAI and Anthropic.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.