Apple’s latest research reveals a critical limitation in current reasoning-enabled language models: as task complexity increases, their performance sharply degrades. In structured tests using classic logic puzzles, these models initially improve with added “thinking” steps, but fail entirely on harder tasks—often generating fewer reasoning steps when more are needed. This counterintuitive “underthinking” highlights a fundamental flaw: current architectures can’t scale reasoning effectively. The findings challenge the idea that simply adding chain-of-thought prompts or compute leads to better thinking, signaling the need for fundamentally new approaches to build truly general reasoning systems.

The Decoder