开源大语言模型在精神病学中的表现：通过非英语记录与英语译文的对比分析进行可用性研究

Performance of Open-Source Large Language Models in Psychiatry: Usability Study Through Comparative Analysis of Non-English Records and English Translations.

作者信息

Kim Min-Gyu, Hwang Gyubeom, Chang Junhyuk, Chang Seheon, Roh Hyun Woong, Park Rae Woong

机构信息

Department of Biomedical Informatics, Ajou University School of Medicine, 206 World cup-ro, Yeongtong-gu, Suwon, 16499, Republic of Korea, 82 312194471, 82 312194472.

Center for Biomedical Informatics Research, Ajou University Medical Cencer, Suown, Republic of Korea.

出版信息

J Med Internet Res. 2025 Aug 18;27:e69857. doi: 10.2196/69857.

DOI:10.2196/69857

PMID:40825309

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12360790/

Abstract

BACKGROUND

Large language models (LLMs) have emerged as promising tools for addressing global disparities in mental health care. However, cloud-based proprietary models raise concerns about data privacy and limited adaptability to local health care systems. In contrast, open-source LLMs offer several advantages, including enhanced data security, the ability to operate offline in resource-limited settings, and greater adaptability to non-English clinical environments. Nevertheless, their performance in psychiatric applications involving non-English language inputs remains largely unexplored.

OBJECTIVE

This study aimed to systematically evaluate the clinical reasoning capabilities and diagnostic accuracy of a locally deployable open-source LLM in both Korean and English psychiatric contexts.

METHODS

The openbuddy-mistral-7b-v13.1 model, fine-tuned from Mistral 7B to enable conversational capabilities in Korean, was selected. A total of 200 deidentified psychiatric interview notes, documented during initial assessments of emergency department patients, were randomly selected from the electronic medical records of a tertiary hospital in South Korea. The dataset included 50 cases each of schizophrenia, bipolar disorder, depressive disorder, and anxiety disorder. The model translated the Korean notes into English and was prompted to extract 5 clinically meaningful diagnostic clues and generate the 2 most likely diagnoses using both the original Korean and translated English inputs. The hallucination rate and clinical relevance of the generated clues were manually evaluated. Top-1 and top-2 diagnostic accuracy were assessed by comparing the model's prediction with the ground truth labels. Additionally, the model's performance on a structured diagnostic task was evaluated using the psychiatry section of the Korean Medical Licensing Examination and its English-translated version.

RESULTS

The model generated 997 clues from Korean interview notes and 1003 clues from English-translated notes. Hallucinations were more frequent with Korean input (n=301, 30.2%) than with English (n=134, 13.4%). Diagnostic relevance was also higher in English (n=429, 42.8%) compared to Korean (n=341, 34.2%). The model showed significantly higher top-1 diagnostic accuracy with English input (74.5% vs 59%; P<.001), while top-2 accuracy was comparable (89.5% vs 90%; P=.56). Across 115 questions from the medical licensing examination, the model performed better in English (n=53, 46.1%) than in Korean (n=37, 32.2%), with superior results in 7 of 11 diagnostic categories.

CONCLUSIONS

This study provides an in-depth evaluation of an open-source LLM in multilingual psychiatric settings. The model's performance varied notably by language, with English input consistently outperforming Korean. These findings highlight the importance of assessing LLMs in diverse linguistic and clinical contexts. To ensure equitable mental health artificial intelligence, further development of high-quality psychiatric datasets in underrepresented languages and culturally adapted training strategies will be essential.

摘要

背景

大语言模型（LLMs）已成为解决全球精神卫生保健差距的有前景的工具。然而，基于云的专有模型引发了对数据隐私的担忧，以及对当地卫生保健系统的适应性有限。相比之下，开源大语言模型具有诸多优势，包括增强的数据安全性、在资源有限的环境中进行离线操作的能力，以及对非英语临床环境的更强适应性。尽管如此，它们在涉及非英语语言输入的精神病学应用中的表现仍 largely 未被探索。

目的

本研究旨在系统评估一个可在本地部署的开源大语言模型在韩语和英语精神病学背景下的临床推理能力和诊断准确性。

方法

选择了从米斯特拉尔7B微调而来的openbuddy - mistral - 7b - v13.1模型，以实现韩语对话能力。从韩国一家三级医院的电子病历中随机选取了200份去识别化的精神病学访谈记录，这些记录是在急诊科患者的初始评估期间记录的。数据集包括精神分裂症、双相情感障碍、抑郁症和焦虑症各50例。该模型将韩语记录翻译成英语，并被促使使用原始韩语和翻译后的英语输入提取5条具有临床意义的诊断线索，并生成2个最可能的诊断。人工评估生成线索的幻觉率和临床相关性。通过将模型的预测与真实标签进行比较来评估前1名和前2名的诊断准确性。此外，使用韩国医学执照考试的精神病学部分及其英语翻译版本评估该模型在结构化诊断任务上的表现。

结果

该模型从韩语访谈记录中生成了997条线索，从英语翻译记录中生成了1003条线索。韩语输入产生幻觉的频率（n = 301，30.2%）高于英语（n = 134，13.4%）。英语的诊断相关性（n = 429，42.8%）也高于韩语（n = 341，34.2%）。该模型在英语输入时前1名诊断准确性显著更高（74.5%对59%；P <.001），而前2名准确性相当（89.5%对90%；P =.56）。在医学执照考试的115个问题中，该模型在英语（n = 53，46.1%）上的表现优于韩语（n = 37，32.2%），在11个诊断类别中的7个类别中结果更好。