TY - GEN
T1 - The AI Imitation Game
T2 - 26th IEEE International Conference on Information Reuse and Integration and Data Science, IRI 2025
AU - Wen, Victor
AU - Peng, Zedong
AU - Chen, Yusi
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Large Language Models (LLMs) have shown significant capabilities in reasoning, decision-making, and natural language understanding. However, it is not clear how these abilities compare to human cognitive skills. This paper evaluates cognitive performances of six state-of-the-art LLMs (ChatGPT-4o, LlaMA 3.1:405B, Claude 3.5 Sonnet, Gemini 2.0 Pro, DeepSeek R1, and DeepSeek V3) using the Self-Administered Gerocognitive Examination (SAGE). We explore how mimicry and Chain-of-Thought (CoT) prompting techniques affect their cognitive performance. Our results show that ChatGPT-4o performs the best in reasoning, memory, and comprehension, while other models frequently struggle with memory recall, real-time tasks, and visuospatial reasoning. Mimicry techniques improved some scores, but also sometimes introduced incorrect reasoning from weaker models. Additionally, we observed significant cognitive anomalies, including hallucinations, indicating limitations in reliability for critical applications. These results confirm that knowledge distillation occurs in current LLMs and that poor knowledge transfer can lead to errors and inconsistencies. Therefore, improved benchmarks and more effective knowledge distillation techniques are needed to make LLMs more reliable.
AB - Large Language Models (LLMs) have shown significant capabilities in reasoning, decision-making, and natural language understanding. However, it is not clear how these abilities compare to human cognitive skills. This paper evaluates cognitive performances of six state-of-the-art LLMs (ChatGPT-4o, LlaMA 3.1:405B, Claude 3.5 Sonnet, Gemini 2.0 Pro, DeepSeek R1, and DeepSeek V3) using the Self-Administered Gerocognitive Examination (SAGE). We explore how mimicry and Chain-of-Thought (CoT) prompting techniques affect their cognitive performance. Our results show that ChatGPT-4o performs the best in reasoning, memory, and comprehension, while other models frequently struggle with memory recall, real-time tasks, and visuospatial reasoning. Mimicry techniques improved some scores, but also sometimes introduced incorrect reasoning from weaker models. Additionally, we observed significant cognitive anomalies, including hallucinations, indicating limitations in reliability for critical applications. These results confirm that knowledge distillation occurs in current LLMs and that poor knowledge transfer can lead to errors and inconsistencies. Therefore, improved benchmarks and more effective knowledge distillation techniques are needed to make LLMs more reliable.
KW - cognitive exam
KW - cognitive impairment
KW - knowledge distillation
KW - large language models
KW - mimicry
UR - https://www.scopus.com/pages/publications/105017842971
U2 - 10.1109/IRI66576.2025.00022
DO - 10.1109/IRI66576.2025.00022
M3 - Conference contribution
AN - SCOPUS:105017842971
T3 - Proceedings - 2025 IEEE International Conference on Information Reuse and Integration and Data Science, IRI 2025
SP - 79
EP - 84
BT - Proceedings - 2025 IEEE International Conference on Information Reuse and Integration and Data Science, IRI 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 6 August 2025 through 8 August 2025
ER -