鶹ýӰ

Chatbots Fail Standard Cognitive Test

— Large language models show susceptibility to cognitive impairment

MedpageToday
A photo of a man about to type a prompt into a an artificial intelligence chat window on a laptop.

President-elect Donald Trump may have on the Montreal Cognitive Assessment (MoCA), but artificial intelligence (AI) chatbots didn't perform nearly as well.

On the well-known cognitive screen, most chatbots -- also known as large language models (LLMs) -- showed signs of mild cognitive impairment.

ChatGPT 4 and Claude 3.5 each scored 25 points, while Gemini 1.0 scored 16 points, reported Roy Dayan, MD, of Hadassah Hebrew University Medical Center in Jerusalem, and co-authors.

Only ChatGPT 4o achieved a score indicating normal cognition (26 points), the researchers said in Christmas issue, an annual collection of light-hearted feature articles and original, peer-reviewed research.

"Colossal advancements in the field of artificial intelligence have led to a flurry of excited and fearful speculation as to whether chatbots surpass human physicians," Dayan and colleagues noted.

Although chatbots err and create fake references, they have proven to be adept at a range of medical diagnostic tasks and test-taking, outscoring human physicians on various exams, including the neurology boards.

While beating physicians in various tests, "LLMs face difficulties with a standard cognitive exam," Dayan said.

"Specifically, they all have impairment in higher visual functions and in spatial orientation. These findings were related to the age of the LLMs, with older chatbots frequently having more difficulties," he told MedPage Today.

One explanation is that unlike the human brain, LLMs lack the ability to perform complicated visual abstractions, since they need to translate visual inputs to verbal ones. "This is in contrast to the human brain, which developed skills of visual abstraction long before verbal language was created," Dayan said.

"We must stress that following our study, LLMs might learn how to 'trick' the MoCA test and produce correct answers copied from human exam-takers," he pointed out.

"However, this does not mean they understood the test. It's similar to the 'Chinese Room' argument," a of philosopher John Searle which holds that a computer executing a program does not have a mind or consciousness.

Dayan and colleagues administered the (version 8.1) to several publicly available LLMs: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 Sonnet (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet).

The MoCA is widely used to detect cognitive dysfunction and early signs of dementia; it assesses attention, memory, language, visuospatial skills, and executive function. MoCA scores of 26-30 are generally considered normal cognition, and scores of 25 or less indicate cognitive impairment.

The researchers gave the LLMs the same instructions as those given to human patients. Scoring followed official guidelines and was evaluated by a practicing neurologist.

All chatbots showed poor performance in visuospatial skills and executive tasks, including the trail-making task and the test. Gemini 1.5 produced a small, avocado shaped clock, which recent studies have shown to be associated with dementia, the researchers noted.

Gemini models also failed at the delayed recall task, which required test-takers to remember a five-word sequence.

Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots, Dayan and colleagues said.

In further visuospatial tests, LLMs could not show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the incongruent stage of the , which uses combinations of color names and font colors to measure how interference affects reaction time during a task.

These findings are observational, and the researchers acknowledged there are essential differences between the human brain and chatbots.

The uniform failure of all LLMs in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could hinder their use in clinical settings, they added.

"While our study was created with humor, we believe it has serious implications in the current discourse regarding the role of AI in medicine: the initial part of every physical examination is the general impression you get while talking to the patient, which requires many visual abstraction skills," Dayan said.

  • Judy George covers neurology and neuroscience news for MedPage Today, writing about brain aging, Alzheimer’s, dementia, MS, rare diseases, epilepsy, autism, headache, stroke, Parkinson’s, ALS, concussion, CTE, sleep, pain, and more.

Disclosures

Dayan and study authors had no disclosures.

Primary Source

The BMJ

Dayan R, et al "Age against the machine -- susceptibility of large language models to cognitive impairment: Cross sectional analysis" BMJ 2024; DOI: 10.1136/bmj-2024-081948.