Leading AI Chatbots Show Signs Of Cognitive Impairment In Dementia Tests, Study Finds
Almost all leading large language models (LLMs) show signs of mild cognitive impairment in tests commonly used to detect early dementia, according to research published in The BMJ.
In a Rush? Here are the Quick Facts!
- Chatbots struggled with visuospatial and executive tasks like clock drawing and trail making.
- Tasks like naming, attention, and language were well-performed by all chatbots.
- Researchers say chatbots’ cognitive limitations may impede their use in clinical settings.
The findings suggest that “older” chatbot versions, like older human patients, tend to perform worse on cognitive assessments, challenging assumptions that AI might soon replace human doctors.
Advances in artificial intelligence have sparked debates about its potential to outperform human physicians, particularly in diagnostic tasks. While previous studies have highlighted LLMs’ medical proficiency, their vulnerability to human-like impairments such as cognitive decline has remained unexplored.
To address this, researchers tested the cognitive abilities of widely available chatbots—ChatGPT 4 and 4o (OpenAI), Claude 3.5 “Sonnet” (Anthropic), and Gemini 1 and 1.5 (Alphabet)—using the Montreal Cognitive Assessment (MoCA).
The MoCA is a diagnostic tool for detecting cognitive impairment and early dementia. It evaluates attention, memory, language, visuospatial skills, and executive functions through a series of short tasks.
Scores range from 0 to 30, with 26 or above generally considered normal. The chatbots were given the same instructions as human patients, and scoring was reviewed by a practicing neurologist.
Interestingly, the “age” of the models—defined as their release date—appears to influence performance. The researchers noted that older versions of chatbots scored lower than newer ones, mirroring patterns of cognitive decline seen in humans.
Older versions tended to score lower than their newer counterparts. For example, Gemini 1.5 outperformed Gemini 1.0 by six points despite being released less than a year later, suggesting rapid “cognitive decline” in the older version.
ChatGPT 4o excelled in attention tasks and succeeded in the Stroop test’s challenging incongruent stage, setting it apart from its peers. However, none of the LLMs completed visuospatial tasks successfully, and Gemini 1.5 notably produced a clock resembling an avocado—an error associated with dementia in human patients.
Despite these struggles, all models performed flawlessly in tasks requiring text-based analysis, such as the naming and similarity sections of the MoCA. This contrast underscores a key limitation: while LLMs handle linguistic abstraction well, they falter in integrating visual and executive functions, which require more complex cognitive processing.
The study acknowledges key differences between the human brain and LLMs but highlights significant limitations in AI cognition. The uniform failure of all tested chatbots in tasks requiring visual abstraction and executive function underscores weaknesses that could hinder their use in clinical settings.
“Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients—artificial intelligence models presenting with cognitive impairment,” the authors concluded.
These findings suggest that while LLMs excel in specific cognitive domains, their deficits in visuospatial and executive tasks raise concerns about their reliability in medical diagnostics and broader applications.
Leave a Comment
Cancel