• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.

作者信息

Yao Zhong, Duan Liantan, Xu Shuo, Chi Lingyi, Sheng Dongfang

机构信息

Department of Neurosurgery, Qilu Hospital of Shandong University, Jinan, China.

Cheeloo College of Medicine, Shandong University, Jinan, China.

出版信息

JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.

DOI:10.2196/69485
PMID:40577654
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12227152/
Abstract

BACKGROUND

Research on large language models (LLMs) in the medical field has predominantly focused on models trained with English-language corpora, evaluating their performance within English-speaking contexts. The performances of models trained with non-English language corpora and their performance in non-English contexts remain underexplored.

OBJECTIVE

This study aimed to evaluate the performances of LLMs trained on different languages corpora by using the Chinese National Medical Licensing Examination (CNMLE) as a benchmark and constructed analogous questions.

METHODS

Under different prompt settings, we sequentially posed questions to 7 LLMs: 2 primarily trained on English-language corpora and 5 primarily on Chinese-language corpora. The models' responses were compared against standard answers to calculate the accuracy rate of each model. Further subgroup analyses were conducted by categorizing the questions based on various criteria. We also collected error sets to explore patterns of mistakes across different models.

RESULTS

Under the zero-shot setting, 6 out of 7 models exceeded the passing level, with the highest accuracy rate achieved by the Chinese LLM Baichuan (86.67%), followed by ChatGPT (83.83%). In the constructed questions, all 7 models exceeded the passing threshold, with Baichuan maintaining the highest accuracy rate (87.00%). In few-shot learning, all models exceeded the passing threshold. Baichuan, ChatGLM, and ChatGPT retained the highest accuracy. While Llama showed marked improvement over previous tests, the relative performance rankings of other models stayed similar to previous results. In subgroup analyses, English models demonstrated comparable or superior performance to Chinese models on questions related to ethics and policy. All models except Llama generally had higher accuracy rates for simple questions than for complex ones. The error set of ChatGPT was similar to those of other Chinese models. Multimodel cross-verification outperformed single model, particularly improving accuracy rate on simple questions. The implementation of dual-model and tri-model verification achieved accuracy rates of 94.17% and 96.33% respectively.

CONCLUSIONS

At the current level, LLMs trained primarily on English corpora and those trained mainly on Chinese corpora perform similarly well in CNMLE, with Chinese models still outperforming. The performance difference between ChatGPT and other Chinese LLMs are not solely due to communication barriers but are more likely influenced by disparities in the training data. By using a method of cross-verification with multiple LLMs, excellent performance can be achieved in medical examinations.

摘要

背景

医学领域对大语言模型(LLMs)的研究主要集中在使用英语语料库训练的模型上,评估它们在英语语境中的表现。使用非英语语料库训练的模型的性能及其在非英语语境中的表现仍未得到充分探索。

目的

本研究旨在以中国国家医师资格考试(CNMLE)为基准并构建类似问题,评估在不同语言语料库上训练的大语言模型的性能。

方法

在不同的提示设置下,我们依次向7个大语言模型提出问题:2个主要在英语语料库上训练,5个主要在中文语料库上训练。将模型的回答与标准答案进行比较,以计算每个模型的准确率。通过根据各种标准对问题进行分类来进行进一步的亚组分析。我们还收集了错误集,以探索不同模型的错误模式。

结果

在零样本设置下,7个模型中有6个超过了及格水平,其中中文大语言模型百川的准确率最高(86.67%),其次是ChatGPT(83.83%)。在构建的问题中,所有7个模型都超过了及格阈值,百川保持最高准确率(87.00%)。在少样本学习中,所有模型都超过了及格阈值。百川、ChatGLM和ChatGPT保持了最高准确率。虽然Llama比之前的测试有显著提高,但其他模型的相对性能排名与之前的结果相似。在亚组分析中,在与伦理和政策相关的问题上,英语模型表现出与中文模型相当或更优的性能。除Llama外,所有模型对简单问题的准确率通常高于复杂问题。ChatGPT的错误集与其他中文模型的相似。多模型交叉验证优于单模型,特别是提高了简单问题的准确率。双模型和三模型验证的准确率分别达到了94.17%和96.33%。

结论

在当前水平下,主要在英语语料库上训练的大语言模型和主要在中文语料库上训练的大语言模型在CNMLE中的表现同样出色,中文模型仍然表现更优。ChatGPT与其他中文大语言模型之间的性能差异不仅仅是由于沟通障碍,更可能是受训练数据差异的影响。通过使用多个大语言模型进行交叉验证的方法,可以在医学考试中取得优异成绩。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/942a77762dc2/medinform-v13-e69485-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/d2fd7e50d9c7/medinform-v13-e69485-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/8e1a59ce9e30/medinform-v13-e69485-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/942a77762dc2/medinform-v13-e69485-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/d2fd7e50d9c7/medinform-v13-e69485-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/8e1a59ce9e30/medinform-v13-e69485-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ebae/12227152/942a77762dc2/medinform-v13-e69485-g003.jpg

相似文献

1
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
2
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
3
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
4
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
5
Comparative performance analysis of global and chinese-domain large language models for myopia.全球和中国领域用于近视研究的大语言模型的性能对比分析
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
6
Large Language Model-Assisted Risk-of-Bias Assessment in Randomized Controlled Trials Using the Revised Risk-of-Bias Tool: Usability Study.使用修订后的偏倚风险工具在随机对照试验中进行大语言模型辅助的偏倚风险评估:可用性研究
J Med Internet Res. 2025 Jun 24;27:e70450. doi: 10.2196/70450.
7
The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study.生成式预训练变换器4(GPT-4)分析三种不同语言医学笔记的潜力:一项回顾性模型评估研究。
Lancet Digit Health. 2025 Jan;7(1):e35-e43. doi: 10.1016/S2589-7500(24)00246-2.
8
Drugs for preventing postoperative nausea and vomiting in adults after general anaesthesia: a network meta-analysis.成人全身麻醉后预防术后恶心呕吐的药物:网状Meta分析
Cochrane Database Syst Rev. 2020 Oct 19;10(10):CD012859. doi: 10.1002/14651858.CD012859.pub2.
9
Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report.使用具有特征总结和混合检索增强生成功能的大语言模型增强肺部疾病预测:基于放射学报告的多中心方法学研究
J Med Internet Res. 2025 Jun 11;27:e72638. doi: 10.2196/72638.
10
Rapid, point-of-care antigen tests for diagnosis of SARS-CoV-2 infection.用于 SARS-CoV-2 感染诊断的快速、即时抗原检测。
Cochrane Database Syst Rev. 2022 Jul 22;7(7):CD013705. doi: 10.1002/14651858.CD013705.pub3.

引用本文的文献

1
A Comparative Study of Five Large Language Models' Response for Liver Cancer Comprehensive Treatment.五种大语言模型对肝癌综合治疗反应的比较研究
J Hepatocell Carcinoma. 2025 Aug 20;12:1861-1871. doi: 10.2147/JHC.S531642. eCollection 2025.

本文引用的文献

1
Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study.通过使用大型语言模型进行多代理对话来减轻临床决策中的认知偏差:模拟研究。
J Med Internet Res. 2024 Nov 19;26:e59439. doi: 10.2196/59439.
2
Medical large language models are susceptible to targeted misinformation attacks.医学大语言模型容易受到针对性错误信息攻击。
NPJ Digit Med. 2024 Oct 23;7(1):288. doi: 10.1038/s41746-024-01282-7.
3
Assessing ChatGPT as a Medical Consultation Assistant for Chronic Hepatitis B: Cross-Language Study of English and Chinese.
评估ChatGPT作为慢性乙型肝炎医疗咨询助手:英语和中文的跨语言研究
JMIR Med Inform. 2024 Aug 8;12:e56426. doi: 10.2196/56426.
4
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
5
Development and evaluation of a large language model of ophthalmology in Chinese.中文眼科大语言模型的开发与评估。
Br J Ophthalmol. 2024 Sep 20;108(10):1390-1397. doi: 10.1136/bjo-2023-324526.
6
Large language models leverage external knowledge to extend clinical insight beyond language boundaries.大语言模型利用外部知识将临床洞察力扩展到语言边界之外。
J Am Med Inform Assoc. 2024 Sep 1;31(9):2054-2064. doi: 10.1093/jamia/ocae079.
7
PMC-LLaMA: toward building open-source language models for medicine.PMC-LLaMA:为医学构建开源语言模型的努力。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1833-1843. doi: 10.1093/jamia/ocae045.
8
Evaluating large language models as agents in the clinic.评估大型语言模型作为临床中的智能体。
NPJ Digit Med. 2024 Apr 3;7(1):84. doi: 10.1038/s41746-024-01083-y.
9
Large Language Models in Medicine: The Potentials and Pitfalls : A Narrative Review.医学领域的大型语言模型:潜力与陷阱:一篇叙事性综述。
Ann Intern Med. 2024 Feb;177(2):210-220. doi: 10.7326/M23-2772. Epub 2024 Jan 30.
10
How does ChatGPT-4 preform on non-English national medical licensing examination? An evaluation in Chinese language.ChatGPT-4在非英语国家医学执照考试中的表现如何?中文语言环境下的一项评估。
PLOS Digit Health. 2023 Dec 1;2(12):e0000397. doi: 10.1371/journal.pdig.0000397. eCollection 2023 Dec.