The rise of Artificial Intelligence brought on many conversations about how now computers could think for themselves and answer questions accurately thanks to their so called language reasoning models (LRMs). These models are used by industry giants like OpenAI, DeepSeek, Claude, and Gemini, but perhaps the most well known used program of all is OpenAI’s ChatGPT. This and other programs are supposed to mimic human-like reasoning and use their LRMs to avoid pitfalls like making up information or misreading information that leads to complex tasks. But Apple has conducted some investigations and it turns out that this technology may not be as advanced as we have been led to believe.
First and foremost, Apple has their own version of Artificial Intelligence, so they are not exactly an unbiased party here, but their software does not claim to do most of the things that, for example, OpenAI’s does, so there is some separation there. Still, while their findings are likely valid, we need to take this investigation with a grain of salt.
Apple’s investigation into Artificial Intelligence and their LRMs
The study was conducted by Apple’s AI research team and their findings are not encouraging for those that are sure that Artificial Intelligence is taking over the world. The first thing that they noticed was that LRMs may not be quite as advanced as we hoped after they tested several of the top reasoning models out there.
The tests were performed on Claude 3.7, DeepSeek R1, and o3-mini and they put their LRMs against the non-reasoning versions of the same models to see if the reasoning nature of the algorithm made any difference. It turns out that it does not. Both types of models, reasoning and non-reasoning, struggle once tasks get more complex. In fact, on easier tasks, the simpler LLMs actually did better than their supposedly more advanced counterparts. The only time LRMs started to show any edge was on tasks that hit a middle ground, not too simple, not too hard.
While the tests were fairly exhaustive, they were also meant to trip up the algorithms to ensure accuracy, and that cannot be done just by asking simple questions. Testing logic thinking involves logic questions, and one of the tests that they threw at the algorithm was the Tower of Hanoi test, which is known to trip up even some smart individuals.
For those unfamiliar with it, it is an old puzzle game invented in the 1800s with three pegs and a stack of discs and the goal is to move all the discs from one peg to another, following strict rules. The puzzle was solved back in 1957 by traditional AI, but the new models had a lot of trouble with it.
While this puzzle is meant to trip people up, children have been known to solve it, and yet, According to Apple, Claude could not manage seven discs without dipping below 80% accuracy and o3-mini did not fare much better.
The problem seems to be that reasoning models are not able to scale up their logic like humans do and that once they consider the instructions have been followed, true or not, they stop engaging.
Arizona State University computer scientist Subbarao (Rao) Kambhampati explains that he has observed that people tend to overanthromorphize the reasoning traces of LLMs, calling it “thinking” when it perhaps does not deserve that name. Another of his recent papers showed that even when reasoning traces appear to be correct, final answers sometimes are not.
As a result, while these models might be helpful for stuff like writing drafts, coding ideas, or tossing out brainstorms, they are not ready to take over tasks that demand real problem-solving muscle.
