Xu Yijun, Fang Zhaoxi, Lin Weinan, Jiang Yue, Jin Wen, Balaji Prasanalakshmi, Wang Jiangda, Xia Ting
Department of Computer Science and Engineering, Shaoxing University, Shaoxing, China.
Institute of Artificial Intelligence, Shaoxing University, Shaoxing, China.
Front Psychiatry. 2025 Aug 6;16:1646974. doi: 10.3389/fpsyt.2025.1646974. eCollection 2025.
Large language models (LLMs) have opened up new possibilities in the field of mental health, offering applications in areas such as mental health assessment, psychological counseling, and education. This study systematically evaluates 15 state-of-the-art LLMs, including DeepSeekR1/V3 (March 24, 2025), GPT-4.1 (April 15, 2025), Llama4 (April 5, 2025), and QwQ (March 6, 2025, developed by Alibaba), on two key tasks: mental health knowledge testing and mental illness diagnosis in the Chinese context. We use publicly available datasets, including Dreaddit, SDCNL, and questions from the CAS Counsellor Qualification Exam. Results indicate that DeepSeek-R1, QwQ, and GPT-4.1 outperform other models in both knowledge accuracy and diagnostic performance. Our findings highlight the strengths and limitations of current LLMs in Chinese mental health scenarios and provide clear guidance for selecting and improving models in this sensitive domain.
大语言模型(LLMs)在心理健康领域开辟了新的可能性,在心理健康评估、心理咨询和教育等领域提供了应用。本研究系统评估了15个最先进的大语言模型,包括DeepSeekR1/V3(2025年3月24日)、GPT-4.1(2025年4月15日)、Llama4(2025年4月5日)以及由阿里巴巴开发的QwQ(2025年3月6日),针对两项关键任务:中文背景下的心理健康知识测试和精神疾病诊断。我们使用了公开可用的数据集,包括Dreaddit、SDCNL以及中国国家心理咨询师职业资格考试的问题。结果表明,DeepSeek-R1、QwQ和GPT-4.1在知识准确性和诊断性能方面均优于其他模型。我们的研究结果突出了当前大语言模型在中文心理健康场景中的优势和局限性,并为在这一敏感领域选择和改进模型提供了明确指导。