A game millions of people solve over morning coffee is exposing a fundamental weakness in the Transformer-based LLMs that dominate A.I. today. Here's why Sudoku matters for the future of A.I.:
THE BENCHMARK
Pathway tested its post-transformer architecture, BDH (Baby Dragon Hatchling 🐲) against "Sudoku Extreme," a collection of ~250,000 of the hardest Sudoku puzzles available.
Leading LLMs (such as o3-mini, DeepSeek-R1, Claude 3.7 Sonnet) scored effectively zero percent.
BDH, in stark contrast, solved them at 97.4% accuracy. That's not a marginal gap... it's a categorical one.
WHY SUDOKU IS A GREAT A.I. TEST
Sudoku is a constraint-satisfaction problem: Every move must satisfy multiple rules simultaneously across rows, columns and boxes. It demands search, tracking and backtracking — well beyond pattern-matching.
This makes it a clean proxy for real-world reasoning in medicine, law, operations, planning and tons of other fields, where you balance competing constraints under uncertainty.
WHY TRANSFORMERS STRUGGLE
LLMs turn every problem into text and solve it by predicting the next token. That works brilliantly for language tasks... but Sudoku doesn't live in language.
A transformer's internal state is constrained to ~1,000 floating-point values per token, and each decision gets locked in as text is generated. It can't hold multiple candidate strategies in parallel or backtrack without verbalizing every step.
WHAT BDH DOES DIFFERENTLY
BDH maintains a much larger internal "latent reasoning space" that isn't forced into text (think of a chess grandmaster playing 20 blindfold games without whispering moves to herself).
It uses sparse positive activations (~5% of neurons firing at any time), far more biologically plausible than the dense activation in transformers.
It's a state-based model (no standard attention mechanism), continuously updating internal state that's inspired by biological neuroscience (called Hebbian learning: "neurons that fire together wire together").
It achieves continual learning: BDH can pick up a new game's rules and reach advanced-beginner level in ~20 minutes, then improve through play... at roughly 10x lower cost than the Transformer-based LLMs achieve their near-zero scores.
CAVEAT
BDH is still early: It has been demonstrated at a ~1 billion parameter scale (comparable to GPT-2), not yet at frontier scale.
BOTTOM LINE
...but the data are clear: 0% vs. 97.4% is not incremental. It suggests the transformer's reasoning ceiling is real and alternative architectures can address it. Exciting to see alternatives to the dominant but limiting Transformer architecture emerge!
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.