LLM & Prompting

Large language models, prompt engineering, and benchmarking.

11 posts

Mar 26, 2026

Four Contexts That Decide Whether AI Helps or Wastes Your Time

I spent a weekend stuffing 100MB of PDFs into an agent. Performance got worse. Mapping what I was feeding into four categories finally showed me why.

Mar 12, 2026

570,000 Lines of LLM Code Compiled Fine. It Was 20,171x Slower Than SQLite.

Someone benchmarked an LLM-written Rust reimplementation of SQLite. The gap between code that looks right and code that is right turned out to be five orders of magnitude.

Mar 5, 2026

How Codex Solves the Compaction Problem Differently

I reverse-engineered how Codex handles context overflow compared to Claude Code. The answer involves AES encryption, session handover patterns, and KV cache tricks.

Feb 25, 2026

I Was Too Lazy to Write CLAUDE.md — Turns Out That Was the Right Call

New benchmark data shows AGENTS.md and CLAUDE.md context files actually hurt coding agent performance. Sometimes laziness is the best engineering decision.

Feb 20, 2026

Paste Your Prompt Twice and Watch Accuracy Change

Google Research validated it across 7 models and 7 benchmarks. No training, no prompt engineering. Just copy-paste. I tested it and here's what actually happened.

Feb 18, 2026

From 6.7% to 68.3% Task Success: The Harness Made the 10x Difference, Not the Model

What LangChain's Terminal Bench results and the hashline format experiment revealed. The same model flipped leaderboard rankings, and the reasons came down to three things: prompts, tools, and middleware.

Feb 8, 2026

The AI Chip Map Just Got Redrawn - Agents Changed Everything in 2026

OpenAI's $10B Cerebras deal, Nvidia acquiring Groq, and Google TPU mega-contracts signal a tectonic shift from GPU-centric training to inference-first silicon.

Feb 8, 2026

The AI Flywheel Paradox: OpenAI's Bet on More Compute Amid Overcapacity Fears

While the market warns of GPU overcapacity, OpenAI declares it needs even more compute. The real winner won't be whoever has the most power - it'll be whoever closes the gap between AI capability and actual user experience.

Feb 8, 2026

The AI War Was Won by Focus - What Anthropic's Opus 4.5 Proves About Strategy

Anthropic's Claude Opus 4.5 didn't just set new benchmarks. It proved that going all-in on text, code, and agents while competitors spread thin is the winning play.

Feb 8, 2026

AI Approaches Human Reasoning for the First Time - Poetiq Breaks 50% on ARC-AGI-2

Poetiq's recursive meta-system became the first to surpass 50% on ARC-AGI-2, the benchmark designed to test true general intelligence. Here's how a 6-person team outperformed Google at half the cost.

Feb 8, 2026

Making LLMs Write Code to Read 10M Tokens - How RLM Works

Bigger context windows don't make AI smarter. RLM flips the script by letting LLMs write code to selectively read massive documents instead of ingesting them whole.