• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

全球和中国领域用于近视研究的大语言模型的性能对比分析

Comparative performance analysis of global and chinese-domain large language models for myopia.

作者信息

Jiang Zehua, Xu Yueyuan, Lim Zhi Wei, Wang Ziyao, Han Yingxiang, Yew Samantha Min Er, Pan Zhe, Wang Qian, Wu Gangyue, Wong Tien Yin, Wang Xiaofei, Wang Yaxing, Tham Yih Chung

机构信息

Beijing Visual Science and Translational Eye Research Institute (BERI), Beijing Tsinghua Changgung Hospital Eye Center, School of Clinical Medicine, Tsinghua Medicine, Tsinghua University, Beijing, China.

Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.

出版信息

Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.

DOI:10.1038/s41433-025-03775-5
PMID:40223113
Abstract

BACKGROUND

The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions.

METHODS

Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. "Good"-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. "Poor"-rating responses were further prompted for self-correction and re-analysis.

RESULTS

The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% "Good" responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a "Poor" rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%.

CONCLUSIONS

Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.

摘要

背景

主要基于西方数据训练的全球大型语言模型(LLMs)在其他环境和语言中针对疾病的表现尚不清楚。以近视为例,我们评估了全球通用与中文领域的大型语言模型在解决特定于中文的近视相关问题方面的能力。

方法

纳入了全球大型语言模型(ChatGPT - 3.5、ChatGPT - 4.0、谷歌巴德、Llama - 2 7B Chat)和中文领域大型语言模型(华佗GPT、医典GPT、阿里通义千问、百度文心一言、百度文心大模型4.0)。所有大型语言模型都被要求回答10个领域中的39个特定于中文的近视问题。3位近视专家使用3分制对回答的准确性进行评估。对评为“好”的回答,进一步使用5分制评估其全面性和同理心。对评为“差”的回答,进一步要求其自我纠正并重新分析。

结果

准确性排名前三的大型语言模型是ChatGPT - 3.5(8.72 ± 0.75)、百度文心大模型4.0(8.62 ± 0.62)和ChatGPT - 4.0(8.59 ± 0.93),“好”回答的比例最高,为94.8%。全面性排名前五的大型语言模型是ChatGPT - 3.5(4.58 ± 0.42)、ChatGPT - 4.0(4.56 ± 0.50)、百度文心大模型4.0(4.44 ± 0.49)、医典GPT(4.34 ± 0.59)和百度文心一言(4.22 ± 0.74)(与ChatGPT - 3.5相比,所有p ≥ 0.059)。同理心方面排名前五的是ChatGPT - 3.5(4.75 ± 0.25)、ChatGPT - 4.0(4.68 ± 0.32)、医典GPT(4.50 ± 0.47)、百度文心一言(4.42 ± 0.46)和百度文心大模型4.0(4.34 ± 0.64)(与ChatGPT - 3.5相比,所有p ≥ 0.052)。百度文心大模型4.0没有得到“差”的评分,而其他模型展示了自我纠正能力,改进幅度在50%到100%之间。

结论

全球通用和中文领域的大型语言模型在解决特定于中文的近视相关问题方面表现出有效性能。尽管主要使用非中文数据并以英语进行训练,但全球通用大型语言模型在中文环境中表现出最佳性能。

相似文献

1
Comparative performance analysis of global and chinese-domain large language models for myopia.全球和中国领域用于近视研究的大语言模型的性能对比分析
Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.
2
Clinical Management of Wasp Stings Using Large Language Models: Cross-Sectional Evaluation Study.使用大语言模型对黄蜂蜇伤进行临床管理:横断面评估研究
J Med Internet Res. 2025 Jun 4;27:e67489. doi: 10.2196/67489.
3
Large language models for diabetes training: a prospective study.用于糖尿病培训的大语言模型:一项前瞻性研究。
Sci Bull (Beijing). 2025 Mar 30;70(6):934-942. doi: 10.1016/j.scib.2025.01.034. Epub 2025 Jan 27.
4
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
5
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
6
Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.ChatGPT-4o与四个开源大语言模型基于中国罕见病目录生成诊断的性能:比较研究
J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.
7
Comparison of preoperative education by artificial intelligence versus traditional physicians in perioperative management of urolithiasis surgery: a prospective single-blind randomized controlled trial conducted in China.人工智能与传统医生进行术前教育在尿路结石手术围手术期管理中的比较:一项在中国进行的前瞻性单盲随机对照试验。
Front Med (Lausanne). 2025 Jun 25;12:1543630. doi: 10.3389/fmed.2025.1543630. eCollection 2025.
8
Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis.中文自闭症患者网络问诊中,医生与大型语言模型聊天机器人回复的对比分析:横断面研究。
J Med Internet Res. 2024 Apr 30;26:e54706. doi: 10.2196/54706.
9
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
10
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

引用本文的文献

1
Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.评估大型语言模型在与上睑下垂相关问题中的表现:一项跨语言研究。
Transl Vis Sci Technol. 2025 Jul 1;14(7):9. doi: 10.1167/tvst.14.7.9.
2
To take a different approach: Can large language models provide knowledge related to respiratory aspiration?换一种方式来看:大语言模型能否提供与呼吸道误吸相关的知识?
Digit Health. 2025 Jul 10;11:20552076251349616. doi: 10.1177/20552076251349616. eCollection 2025 Jan-Dec.

本文引用的文献

1
Development and Evaluation of a Retrieval-Augmented Large Language Model Framework for Ophthalmology.开发和评估眼科检索增强型大型语言模型框架。
JAMA Ophthalmol. 2024 Sep 1;142(9):798-805. doi: 10.1001/jamaophthalmol.2024.2513.
2
Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study.揭示 ChatGPT 在视网膜血管疾病分类上的语言差异:一项横断面研究。
J Med Internet Res. 2024 Jan 22;26:e51926. doi: 10.2196/51926.
3
Large language models and their impact in ophthalmology.大语言模型及其在眼科学中的影响。
Lancet Digit Health. 2023 Dec;5(12):e917-e924. doi: 10.1016/S2589-7500(23)00201-7.
4
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
5
ChatGPT Performs on the Chinese National Medical Licensing Examination.ChatGPT 通过中国医师资格考试。
J Med Syst. 2023 Aug 15;47(1):86. doi: 10.1007/s10916-023-01961-0.
6
Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI.ChatGPT 在中文体检、病历和教育方面的表现和探索:为医疗 AI 铺平道路。
Int J Med Inform. 2023 Sep;177:105173. doi: 10.1016/j.ijmedinf.2023.105173. Epub 2023 Aug 4.
7
Implementing a digital comprehensive myopia prevention and control strategy for children and adolescents in China: a cost-effectiveness analysis.在中国为儿童和青少年实施数字化综合近视防控策略:成本效益分析
Lancet Reg Health West Pac. 2023 Jul 13;38:100837. doi: 10.1016/j.lanwpc.2023.100837. eCollection 2023 Sep.
8
ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge.ChatDoctor:一种基于医学领域知识对大型语言模型Meta-AI(LLaMA)进行微调的医学聊天模型。
Cureus. 2023 Jun 24;15(6):e40895. doi: 10.7759/cureus.40895. eCollection 2023 Jun.
9
Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings.评估ChatGPT在眼科领域的表现:对其优缺点的分析。
Ophthalmol Sci. 2023 May 5;3(4):100324. doi: 10.1016/j.xops.2023.100324. eCollection 2023 Dec.
10
Evaluating the Utility of a Large Language Model in Answering Common Patients' Gastrointestinal Health-Related Questions: Are We There Yet?评估大语言模型在回答常见患者胃肠道健康相关问题中的效用:我们做到了吗?
Diagnostics (Basel). 2023 Jun 2;13(11):1950. doi: 10.3390/diagnostics13111950.