• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

开源大语言模型在精神病学中的表现:通过非英语记录与英语译文的对比分析进行可用性研究

Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations.

作者信息

Kim Min-Gyu, Hwang Gyubeom, Chang Junhyuk, Chang Seheon, Roh Hyun Woong, Park Rae Woong

机构信息

Department of Biomedical Informatics, Ajou University School of Medicine, 206 World cup-ro, Yeongtong-gu, Suwon, 16499, Republic of Korea, 82 312194471, 82 312194472.

Center for Biomedical Informatics Research, Ajou University Medical Cencer, Suown, Republic of Korea.

出版信息

J Med Internet Res. 2025 Aug 18;27:e69857. doi: 10.2196/69857.

DOI:10.2196/69857
PMID:40825309
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12360790/
Abstract

BACKGROUND

Large language models (LLMs) have emerged as promising tools for addressing global disparities in mental health care. However, cloud-based proprietary models raise concerns about data privacy and limited adaptability to local health care systems. In contrast, open-source LLMs offer several advantages, including enhanced data security, the ability to operate offline in resource-limited settings, and greater adaptability to non-English clinical environments. Nevertheless, their performance in psychiatric applications involving non-English language inputs remains largely unexplored.

OBJECTIVE

This study aimed to systematically evaluate the clinical reasoning capabilities and diagnostic accuracy of a locally deployable open-source LLM in both Korean and English psychiatric contexts.

METHODS

The openbuddy-mistral-7b-v13.1 model, fine-tuned from Mistral 7B to enable conversational capabilities in Korean, was selected. A total of 200 deidentified psychiatric interview notes, documented during initial assessments of emergency department patients, were randomly selected from the electronic medical records of a tertiary hospital in South Korea. The dataset included 50 cases each of schizophrenia, bipolar disorder, depressive disorder, and anxiety disorder. The model translated the Korean notes into English and was prompted to extract 5 clinically meaningful diagnostic clues and generate the 2 most likely diagnoses using both the original Korean and translated English inputs. The hallucination rate and clinical relevance of the generated clues were manually evaluated. Top-1 and top-2 diagnostic accuracy were assessed by comparing the model's prediction with the ground truth labels. Additionally, the model's performance on a structured diagnostic task was evaluated using the psychiatry section of the Korean Medical Licensing Examination and its English-translated version.

RESULTS

The model generated 997 clues from Korean interview notes and 1003 clues from English-translated notes. Hallucinations were more frequent with Korean input (n=301, 30.2%) than with English (n=134, 13.4%). Diagnostic relevance was also higher in English (n=429, 42.8%) compared to Korean (n=341, 34.2%). The model showed significantly higher top-1 diagnostic accuracy with English input (74.5% vs 59%; P<.001), while top-2 accuracy was comparable (89.5% vs 90%; P=.56). Across 115 questions from the medical licensing examination, the model performed better in English (n=53, 46.1%) than in Korean (n=37, 32.2%), with superior results in 7 of 11 diagnostic categories.

CONCLUSIONS

This study provides an in-depth evaluation of an open-source LLM in multilingual psychiatric settings. The model's performance varied notably by language, with English input consistently outperforming Korean. These findings highlight the importance of assessing LLMs in diverse linguistic and clinical contexts. To ensure equitable mental health artificial intelligence, further development of high-quality psychiatric datasets in underrepresented languages and culturally adapted training strategies will be essential.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/0a64f30e97ef/jmir-v27-e69857-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/ebc66ac932e3/jmir-v27-e69857-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/59f954154950/jmir-v27-e69857-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/d5f782bb950c/jmir-v27-e69857-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/0a64f30e97ef/jmir-v27-e69857-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/ebc66ac932e3/jmir-v27-e69857-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/59f954154950/jmir-v27-e69857-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/d5f782bb950c/jmir-v27-e69857-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/fe07/12360790/0a64f30e97ef/jmir-v27-e69857-g004.jpg
摘要

背景

大语言模型(LLMs)已成为解决全球精神卫生保健差距的有前景的工具。然而,基于云的专有模型引发了对数据隐私的担忧,以及对当地卫生保健系统的适应性有限。相比之下,开源大语言模型具有诸多优势,包括增强的数据安全性、在资源有限的环境中进行离线操作的能力,以及对非英语临床环境的更强适应性。尽管如此,它们在涉及非英语语言输入的精神病学应用中的表现仍 largely 未被探索。

目的

本研究旨在系统评估一个可在本地部署的开源大语言模型在韩语和英语精神病学背景下的临床推理能力和诊断准确性。

方法

选择了从米斯特拉尔7B微调而来的openbuddy - mistral - 7b - v13.1模型,以实现韩语对话能力。从韩国一家三级医院的电子病历中随机选取了200份去识别化的精神病学访谈记录,这些记录是在急诊科患者的初始评估期间记录的。数据集包括精神分裂症、双相情感障碍、抑郁症和焦虑症各50例。该模型将韩语记录翻译成英语,并被促使使用原始韩语和翻译后的英语输入提取5条具有临床意义的诊断线索,并生成2个最可能的诊断。人工评估生成线索的幻觉率和临床相关性。通过将模型的预测与真实标签进行比较来评估前1名和前2名的诊断准确性。此外,使用韩国医学执照考试的精神病学部分及其英语翻译版本评估该模型在结构化诊断任务上的表现。

结果

该模型从韩语访谈记录中生成了997条线索,从英语翻译记录中生成了1003条线索。韩语输入产生幻觉的频率(n = 301,30.2%)高于英语(n = 134,13.4%)。英语的诊断相关性(n = 429,42.8%)也高于韩语(n = 341,34.2%)。该模型在英语输入时前1名诊断准确性显著更高(74.5%对59%;P <.001),而前2名准确性相当(89.5%对90%;P =.56)。在医学执照考试的115个问题中,该模型在英语(n = 53,46.1%)上的表现优于韩语(n = 37,32.2%),在11个诊断类别中的7个类别中结果更好。

结论

本研究对开源大语言模型在多语言精神病学环境中进行了深入评估。该模型的表现因语言而异,英语输入始终优于韩语。这些发现凸显了在不同语言和临床背景下评估大语言模型的重要性。为确保公平的心理健康人工智能,进一步开发代表性不足语言的高质量精神病学数据集和文化适应性训练策略至关重要。

相似文献

1
Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations.开源大语言模型在精神病学中的表现:通过非英语记录与英语译文的对比分析进行可用性研究
J Med Internet Res. 2025 Aug 18;27:e69857. doi: 10.2196/69857.
2
Prescription of Controlled Substances: Benefits and Risks管制药品的处方:益处与风险
3
Harnessing Moderate-Sized Language Models for Reliable Patient Data Deidentification in Emergency Department Records: Algorithm Development, Validation, and Implementation Study.利用中等规模语言模型对急诊科记录中的患者数据进行可靠去识别:算法开发、验证与实施研究。
JMIR AI. 2025 Apr 1;4:e57828. doi: 10.2196/57828.
4
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
5
A dataset and benchmark for hospital course summarization with adapted large language models.一个用于医院病程总结的数据集和基准测试,采用了适配的大语言模型。
J Am Med Inform Assoc. 2025 Mar 1;32(3):470-479. doi: 10.1093/jamia/ocae312.
6
Are Current Survival Prediction Tools Useful When Treating Subsequent Skeletal-related Events From Bone Metastases?当前的生存预测工具在治疗骨转移后的骨骼相关事件时有用吗?
Clin Orthop Relat Res. 2024 Sep 1;482(9):1710-1721. doi: 10.1097/CORR.0000000000003030. Epub 2024 Mar 22.
7
Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study.在医学视觉问答中评估Bard Gemini Pro和GPT-4 Vision对学生表现的影响:比较案例研究
JMIR Form Res. 2024 Dec 17;8:e57592. doi: 10.2196/57592.
8
Evaluating the Reasoning Capabilities of Large Language Models for Medical Coding and Hospital Readmission Risk Stratification: Zero-Shot Prompting Approach.评估大型语言模型在医学编码和医院再入院风险分层方面的推理能力:零样本提示方法。
J Med Internet Res. 2025 Jul 30;27:e74142. doi: 10.2196/74142.
9
Large Language Model Symptom Identification From Clinical Text: Multicenter Study.基于临床文本的大语言模型症状识别:多中心研究。
J Med Internet Res. 2025 Jul 31;27:e72984. doi: 10.2196/72984.
10
Comparison of Two Modern Survival Prediction Tools, SORG-MLA and METSSS, in Patients With Symptomatic Long-bone Metastases Who Underwent Local Treatment With Surgery Followed by Radiotherapy and With Radiotherapy Alone.两种现代生存预测工具 SORG-MLA 和 METSSS 在接受手术联合放疗和单纯放疗治疗有症状长骨转移患者中的比较。
Clin Orthop Relat Res. 2024 Dec 1;482(12):2193-2208. doi: 10.1097/CORR.0000000000003185. Epub 2024 Jul 23.

本文引用的文献

1
Sociodemographic biases in medical decision making by large language models.大语言模型在医疗决策中的社会人口统计学偏差。
Nat Med. 2025 Apr 7. doi: 10.1038/s41591-025-03626-6.
2
The path forward for large language models in medicine is open.医学领域大语言模型的未来发展道路是开放的。
NPJ Digit Med. 2024 Nov 27;7(1):339. doi: 10.1038/s41746-024-01344-w.
3
Equity in Digital Mental Health Interventions in the United States: Where to Next?美国数字心理健康干预中的公平性:下一步在哪里?
J Med Internet Res. 2024 Sep 24;26:e59939. doi: 10.2196/59939.
4
A toolbox for surfacing health equity harms and biases in large language models.一个用于揭示大语言模型中健康公平性危害和偏见的工具箱。
Nat Med. 2024 Dec;30(12):3590-3600. doi: 10.1038/s41591-024-03258-2. Epub 2024 Sep 23.
5
Cultural bias and cultural alignment of large language models.大语言模型的文化偏见与文化契合度
PNAS Nexus. 2024 Sep 17;3(9):pgae346. doi: 10.1093/pnasnexus/pgae346. eCollection 2024 Sep.
6
Closing the gap between open source and commercial large language models for medical evidence summarization.弥合用于医学证据总结的开源大型语言模型与商业大型语言模型之间的差距。
NPJ Digit Med. 2024 Sep 9;7(1):239. doi: 10.1038/s41746-024-01239-w.
7
The Opportunities and Risks of Large Language Models in Mental Health.大语言模型在精神健康中的机遇与风险。
JMIR Ment Health. 2024 Jul 29;11:e59479. doi: 10.2196/59479.
8
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
9
Applications of large language models in psychiatry: a systematic review.大语言模型在精神病学中的应用:一项系统综述。
Front Psychiatry. 2024 Jun 24;15:1422807. doi: 10.3389/fpsyt.2024.1422807. eCollection 2024.
10
A comparative study of English and Japanese ChatGPT responses to anaesthesia-related medical questions.关于英语和日语版ChatGPT对麻醉相关医学问题回答的比较研究。
BJA Open. 2024 Jun 14;10:100296. doi: 10.1016/j.bjao.2024.100296. eCollection 2024 Jun.