Hua Yining, Na Hongbin, Li Zehan, Liu Fenglin, Fang Xiao, Clifton David, Torous John
Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Department of Psychiatry, Beth Israel Deaconess Medical Center, Boston, MA, USA.
NPJ Digit Med. 2025 Apr 30;8(1):230. doi: 10.1038/s41746-025-01611-4.
Large language models (LLMs) show promise in mental health care for handling human-like conversations, but their effectiveness remains uncertain. This scoping review synthesizes existing research on LLM applications in mental health care, reviews model performance and clinical effectiveness, identifies gaps in current evaluation methods following a structured evaluation framework, and provides recommendations for future development. A systematic search identified 726 unique articles, of which 16 met the inclusion criteria. These studies, encompassing applications such as clinical assistance, counseling, therapy, and emotional support, show initial promises. However, the evaluation methods were often non-standardized, with most studies relying on ad-hoc scales that limit comparability and robustness. A reliance on prompt-tuning proprietary models, such as OpenAI's GPT series, also raises concerns about transparency and reproducibility. As current evidence does not fully support their use as standalone interventions, more rigorous development and evaluation guidelines are needed for safe, effective clinical integration.
大型语言模型(LLMs)在心理健康护理中展现出进行类人对话的潜力,但其有效性仍不确定。本综述综合了关于LLMs在心理健康护理中应用的现有研究,评估模型性能和临床效果,依据结构化评估框架找出当前评估方法中的差距,并为未来发展提供建议。系统检索共识别出726篇独特文章,其中16篇符合纳入标准。这些研究涵盖临床辅助、咨询、治疗和情感支持等应用,显示出初步成效。然而,评估方法往往不规范,大多数研究依赖临时量表,这限制了可比性和稳健性。对如OpenAI的GPT系列等专有模型的提示调整的依赖,也引发了对透明度和可重复性的担忧。由于目前的证据并不完全支持将其作为独立干预措施使用,因此需要更严格的开发和评估指南,以实现安全、有效的临床整合。