Levkovich Inbar
Faculty of Education, Tel-Hai Academic College, Upper Galilee 2208, Israel.
Eur J Investig Health Psychol Educ. 2025 Jan 18;15(1):9. doi: 10.3390/ejihpe15010009.
Large language models (LLMs) offer promising possibilities in mental health, yet their ability to assess disorders and recommend treatments remains underexplored. This quantitative cross-sectional study evaluated four LLMs (Gemini (Gemini 2.0 Flash Experimental), Claude (Claude 3.5 Sonnet), ChatGPT-3.5, and ChatGPT-4) using text vignettes representing conditions such as depression, suicidal ideation, early and chronic schizophrenia, social phobia, and PTSD. Each model's diagnostic accuracy, treatment recommendations, and predicted outcomes were compared with norms established by mental health professionals. Findings indicated that for certain conditions, including depression and PTSD, models like ChatGPT-4 achieved higher diagnostic accuracy compared to human professionals. However, in more complex cases, such as early schizophrenia, LLM performance varied, with ChatGPT-4 achieving only 55% accuracy, while other LLMs and professionals performed better. LLMs tended to suggest a broader range of proactive treatments, whereas professionals recommended more targeted psychiatric consultations and specific medications. In terms of outcome predictions, professionals were generally more optimistic regarding full recovery, especially with treatment, while LLMs predicted lower full recovery rates and higher partial recovery rates, particularly in untreated cases. While LLMs recommend a broader treatment range, their conservative recovery predictions, particularly for complex conditions, highlight the need for professional oversight. LLMs provide valuable support in diagnostics and treatment planning but cannot replace professional discretion.
大语言模型(LLMs)在心理健康领域展现出了广阔的前景,然而其评估疾病和推荐治疗方案的能力仍有待深入探索。这项定量横断面研究使用了代表抑郁症、自杀意念、早期和慢性精神分裂症、社交恐惧症以及创伤后应激障碍(PTSD)等病症的文本 vignettes 对四个大语言模型(Gemini(Gemini 2.0 Flash Experimental)、Claude(Claude 3.5 Sonnet)、ChatGPT - 3.5 和 ChatGPT - 4)进行了评估。将每个模型的诊断准确性、治疗建议和预测结果与心理健康专业人员确立的标准进行了比较。研究结果表明,对于某些病症,包括抑郁症和创伤后应激障碍,像 ChatGPT - 4 这样的模型与人类专业人员相比,诊断准确性更高。然而,在更复杂的病例中,如早期精神分裂症,大语言模型的表现各不相同,ChatGPT - 4 的准确率仅为 55%,而其他大语言模型和专业人员的表现则更好。大语言模型倾向于建议更广泛的积极治疗方法,而专业人员则推荐更具针对性的精神科会诊和特定药物。在结果预测方面,专业人员通常对完全康复更为乐观,尤其是在接受治疗的情况下,而大语言模型预测的完全康复率较低,部分康复率较高,特别是在未接受治疗的病例中。虽然大语言模型推荐的治疗范围更广,但其保守的康复预测,尤其是对于复杂病症,凸显了专业监督的必要性。大语言模型在诊断和治疗规划方面提供了有价值的支持,但不能取代专业判断。