Suppr超能文献

大语言模型在临床上的改进成为现实了吗?ChatGPT模型与DeepSeek-R1在精神科评估与干预方面的纵向比较。

Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions.

作者信息

Smith Alexander, Liebrenz Michael, Bhugra Dinesh, Grana Juan, Schleifer Roman, Buadze Ana

机构信息

Department of Forensic Psychiatry, University of Bern, Switzerland.

Institute of Psychiatry, Psychology and Neuroscience, King's College London, UK.

出版信息

Int J Soc Psychiatry. 2025 Jul 31:207640251358071. doi: 10.1177/00207640251358071.

Abstract

BACKGROUND

Potential clinical applications for emerging large-language models (LLMs; e.g. ChatGPT) are well-documented, and newer systems (e.g. DeepSeek) have attracted increasing attention. Yet, important questions endure about their reliability and cultural responsiveness in psychiatric settings.

METHODS

This study explored the diagnostic accuracy, therapeutic appropriateness and cultural sensitivity of ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 (all March 2025 versions). DeepSeek-R1 was evaluated for one of the first times in this context, and this also marks one of the first longitudinal inquiries into LLMs in psychiatry. Three psychiatric cases from earlier literature about sleep-related problems and cooccurring issues were utilised, allowing for cross-comparisons with a 2023 ChatGPT version, alongside culturally-specific vignette adaptations. Thus, overall, outputs for six scenarios were derived and were subsequently qualitatively reviewed by four psychiatrists for their strengths and limitations.

RESULTS

ChatGPT-4o, ChatGPT-4.5, and DeepSeek-R1 showed modest improvements from the 2023 ChatGPT model but still exhibited significant limitations. Communication was empathetic and non-pharmacological advice typically adhered to evidence-based practices. Primary diagnoses were broadly accurate but often omitted somatic factors and comorbidities. Nevertheless, consistent with past findings, clinical reasoning worsened as case complexity increased; this was especially apparent for suicidality safeguards and risk stratification. Pharmacological recommendations frequently diverged from established guidelines, whilst cultural adaptations remained largely superficial. Finally, output variance was noted in several cases, and the LLMs occasionally failed to clarify their inability to prescribe medication.

CONCLUSION

Despite incremental advancements, ChatGPT-4o, ChatGPT-4.5 and DeepSeek-R1 were affected by major shortcomings, particularly in risk evaluation, evidence-based practice adherence, and cultural awareness. Presently, we conclude that these tools cannot substitute mental health professionals but may confer adjunctive benefits. Notably, DeepSeek-R1 did not fall behind its counterparts, warranting further inquiries in jurisdictions permitting its use. Equally, greater emphasis on transparency and prompt engineering would also be necessary for safe and equitable LLM deployment in psychiatry.

摘要

背景

新兴的大语言模型(LLMs,如ChatGPT)的潜在临床应用已有充分记录,更新的系统(如DeepSeek)也越来越受到关注。然而,关于它们在精神科环境中的可靠性和文化适应性仍存在重要问题。

方法

本研究探讨了ChatGPT-4o、ChatGPT-4.5和DeepSeek-R1(均为2025年3月版本)的诊断准确性、治疗适宜性和文化敏感性。DeepSeek-R1在此背景下首次得到评估,这也是精神科领域对大语言模型的首批纵向研究之一。我们使用了早期文献中关于睡眠相关问题及并发问题的三个精神科病例,以便与2023年版的ChatGPT进行交叉比较,并对特定文化背景的案例进行了改编。因此,总共得出了六种情景的输出结果,随后由四位精神科医生对其优缺点进行了定性评估。

结果

ChatGPT-4o、ChatGPT-4.5和DeepSeek-R1相较于2023年的ChatGPT模型有适度改进,但仍存在显著局限性。交流具有同理心,非药物治疗建议通常遵循循证实践。主要诊断大致准确,但往往忽略躯体因素和合并症。然而,与过去的研究结果一致,随着病例复杂性增加,临床推理能力下降;这在自杀防范和风险分层方面尤为明显。药物治疗建议常常与既定指南不符,而文化适应性在很大程度上仍流于表面。最后,在几个案例中发现了输出结果的差异,大语言模型偶尔未能说明其无法开药的情况。

结论

尽管有渐进式进展,但ChatGPT-4o、ChatGPT-4.5和DeepSeek-R1仍受到重大缺陷的影响,尤其是在风险评估方面、循证实践的遵循以及文化意识方面。目前,我们得出结论,这些工具不能替代心理健康专业人员,但可能具有辅助作用。值得注意的是,DeepSeek-R1并不落后于其他同类产品,在允许其使用的司法管辖区有必要进行进一步研究。同样,为了在精神科安全、公平地部署大语言模型,还需要更加强调透明度和提示工程。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验