AI struggles with simple arithmetic in essay format, reveals Apple study

A recent study from Apple highlights the limitations of AI models in abstract reasoning, showcasing their inability to accurately solve straightforward arithmetic problems presented in a narrative context.

In a significant revelation about the current limitations of artificial intelligence, researchers at Apple have found that many state-of-the-art AI models struggle to solve simple arithmetic problems when presented in essay form. This research underscores the gap between human and artificial reasoning, especially in the context of abstract thinking.

Oliver’s Kiwi Conundrum, a straightforward arithmetic problem, served as a litmus test for AI capabilities. According to the problem, Oliver picks 44 kiwis on Friday, 58 on Saturday, and double the number he picked on Friday, amounting to 88 kiwis, on Sunday. However, five of these were smaller than average, yet the size was irrelevant to the count. The correct answer, 190 kiwis, was consistently missed by the AI models, with many incorrectly incorporating the size of the kiwis as a factor, reflecting a fundamental misunderstanding of irrelevant details.

This study was spearheaded by a team from Apple, a leading consumer tech company that has recently been integrating AI features into its products. The results have drawn attention not only for their implications but also due to the high profile of the company involved. Critic Gary Marcus noted that while the findings received widespread attention, they should not come as a surprise, as the limitations of AI in abstract reasoning have been documented in various other studies.

The AI models tested, including those from companies like OpenAI, Meta, and Google, among others, were noted to deliver “catastrophic performance drops” when extraneous details were introduced into mathematical problems. For instance, another problem involving school supplies was also misunderstood when inflation was mentioned, although irrelevant to solving the immediate question.

The overarching issue, as explained by the lead author Mehrdad Farajtabar, lies in the core functioning of these AI models, particularly large language models (LLMs), which operate by matching patterns rather than understanding mathematical principles. This results in a form of reasoning that is more about pattern recognition than genuine cognitive processing.

The Apple research aligns with previous assessments about AI’s current capabilities and its limitations when applied to reasoning tasks that are intuitively understood by even young children. Melanie Mitchell from the Santa Fe Institute articulates this as a gap in basic abstract reasoning, highlighting the difference between AI’s programmed responses and the human ability to adaptively apply logic and reason from examples.

Such studies point to broader implications for AI deployment in critical fields. In applications where precision is paramount, such as healthcare, the potential for AI to incorporate errors—not due to knowledge limitations but through “hallucinations” or fabricated reasoning—is concerning. This has been exemplified by other AI tools, such as OpenAI’s Whisper, which has shown similar challenges in reliability, particularly with fabricated additions to transcriptions.

With technology rapidly advancing, companies promoting AI systems, such as Elon Musk’s AI application Grok for medical analysis, continue to assert improvements. However, experts like Marcus caution that errors are inherent to the current AI systems, necessitating human oversight to avoid potentially severe real-world consequences. The Apple study, therefore, serves as an important reminder that while AI has considerable potential, there are intrinsic limitations in its reasoning ability that must presently be mitigated by human judgment and review.

Source: Noah Wire Services