Index
3 min read 2026 Updated Feb 18, 2026

AI Approaches Human Reasoning for the First Time - Poetiq Breaks 50% on ARC-AGI-2

Poetiq's recursive meta-system became the first to surpass 50% on ARC-AGI-2, the benchmark designed to test true general intelligence. Here's how a 6-person team outperformed Google at half the cost.

ARC-AGI is the test designed to evaluate whether AI possesses genuine general intelligence. It does not ask models to regurgitate training data. Instead, it presents completely novel pattern problems and requires the system to infer the underlying rules on its own. Humans average around 60% accuracy. Leading AI models scored under 5% on ARC-AGI-2 in early 2025.

Poetiq, a six-person team with 53 years of combined experience from Google DeepMind, has now been officially verified by the ARC Prize Foundation at 54% accuracy on ARC-AGI-2. They are the first to cross 50%. The cost per problem is $30.57, compared to Gemini 3 Deep Think’s $77.16 for a lower score. Their approach and prompts are fully open-sourced on GitHub.

Recursive Reasoning Over Raw Scale

The core architecture is a meta-system that does not train new models. Instead, it orchestrates existing LLMs through iterative loops of reasoning.

The system generates a candidate solution, critiques it, analyzes the feedback, and uses the LLM to refine the answer, then repeats. The prompt is the interface; the reasoning process is the product. This is a deliberate departure from standard chain-of-thought prompting, which asks once and accepts the output. Poetiq’s system treats each answer as a draft to be improved through structured self-critique.

The jump from sub-5% to 54% in under a year is striking. Whether ARC-AGI-2 actually measures what its designers claim, general intelligence rather than a specific pattern-matching capability that recursive refinement happens to exploit well, is a fair question. Benchmark goodhart is real, and 54% is still well below human-level 60%.

Self-Auditing: Knowing When to Stop

The self-auditing mechanism is where the architecture gets interesting. The system determines autonomously when it has gathered sufficient information and when to terminate the reasoning process.

This is not just an engineering convenience. By averaging fewer than two LLM requests per ARC problem, the system avoids the runaway compute costs that plague naive “keep trying” loops. The cost efficiency is a direct consequence of the stopping criterion, not a separate optimization. A system that cannot decide when to stop tends to either terminate too early or burn tokens indefinitely, and Poetiq appears to have found a workable middle ground, at least on this benchmark.

What the Architecture Suggests

Following the Tiny Recursive Model (TRM) and RLM, Poetiq’s result adds evidence that recursive reasoning architectures are a viable path worth taking seriously. The lesson is not about bigger models or longer context windows. Designing systems that generate, evaluate, and refine in structured loops can outperform brute-force scale at a fraction of the cost.

How well this transfers to tasks outside ARC-AGI-2’s grid-pattern domain is the open question. The methodology is available on GitHub for anyone who wants to test that generalization directly.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.