评估四种大型语言模型解答中国患者关于干眼症问题的性能：一项两阶段研究。

Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.

作者信息

Shi Runhan, Liu Steven, Xu Xinwei, Ye Zhengqiang, Yang Jin, Le Qihua, Qiu Jini, Tian Lijia, Wei Anji, Shan Kun, Zhao Chen, Sun Xinghuai, Zhou Xingtao, Hong Jiaxu

机构信息

Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China.

NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China.

出版信息

Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.

DOI:10.1016/j.heliyon.2024.e34391

PMID:39113991

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11305187/

Abstract

PURPOSE

To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED).

DESIGN

Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase.

SUBJECTS

Eight board-certified ophthalmologists and 46 patients with DED.

METHODS

The chatbots' responses to Chinese patients' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains.

MAIN OUTCOME MEASURES

Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase.

RESULTS

In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47, < 0.05). However, the readability analysis revealed that GPT-4's responses were highly complex, with an average score of 12.86 ( < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2's responses was relatively poor (4.04 vs. 4.48 for GPT-4, < 0.05). Nevertheless, Baichuan 2's recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61, < 0.05; ophthalmologist readability: 2.67 vs. 4.33).

CONCLUSIONS

The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.

摘要

目的

评估四种大语言模型（LLMs）——GPT-4、PaLM 2、通义千问和百川2——对中国干眼症（DED）患者询问生成回答的表现。

设计

两阶段研究，第一阶段为横断面测试，第二阶段为真实世界临床评估。

研究对象

八位获得委员会认证的眼科医生和46名干眼症患者。

方法

通过评估来评定聊天机器人对中国干眼症患者询问的回答。在第一阶段，六位资深眼科医生使用5点李克特量表，从正确性、完整性、可读性、帮助性和安全性五个领域对聊天机器人的回答进行主观评分。使用中文可读性分析平台进行客观可读性分析。在第二阶段，46名有代表性的干眼症患者向在第一阶段表现最佳的两种语言模型（GPT-4和百川2）提问，然后对答案的满意度和可读性进行评分。随后，两位资深眼科医生对五个领域的回答进行评估。

主要观察指标

第一阶段五个领域的主观评分和客观可读性评分。第二阶段患者的满意度、可读性评分以及五个领域的主观评分。

结果

在第一阶段，GPT-4在五个领域均表现出卓越性能（正确性：4.47；完整性：4.39；可读性：4.47；帮助性：4.49；安全性：4.47，<0.05）。然而，可读性分析显示，GPT-4的回答非常复杂，平均得分为12.86（<0.05），而通义千问、百川2和PaLM 2的得分分别为10.87、11.53和11.26。在第二阶段，从五个领域的评分来看，GPT-4和百川2都擅长回答干眼症患者提出的问题。然而，百川2回答的完整性相对较差（GPT-4为4.48，百川2为4.04，<0.05）。尽管如此，百川2的建议比GPT-4的更易懂（患者可读性：3.91对4.61，<0.05；眼科医生可读性：2.67对4.33）。

结论

研究结果强调了大语言模型的潜力，尤其是GPT-4和百川2在为中国干眼症患者的问题提供准确全面回答方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/062669bbb39b/gr1.jpg

相似文献

Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.

Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.

Evaluating the effectiveness of large language models in patient education for conjunctivitis.

Br J Ophthalmol. 2025 Jan 28;109(2):185-191. doi: 10.1136/bjo-2024-325599.

Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.

J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.

Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.

ArXiv. 2024 Jan 23:arXiv:2402.01693v1.

Safety and quality of AI chatbots for drug-related inquiries: A real-world comparison with licensed pharmacists.

Digit Health. 2024 May 15;10:20552076241253523. doi: 10.1177/20552076241253523. eCollection 2024 Jan-Dec.

Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources.

Surg Endosc. 2024 May;38(5):2522-2532. doi: 10.1007/s00464-024-10720-2. Epub 2024 Mar 12.

Accuracy, readability, and understandability of large language models for prostate cancer information to the public.

Prostate Cancer Prostatic Dis. 2024 May 14. doi: 10.1038/s41391-024-00826-y.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.

JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.

PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.

引用本文的文献

Assessing ChatGPT's Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis.

JMIR Cancer. 2025 Aug 13;11:e69783. doi: 10.2196/69783.

Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.

Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.

Machine learning approaches for EGFR mutation status prediction in NSCLC: an updated systematic review.

Front Oncol. 2025 Jul 10;15:1576461. doi: 10.3389/fonc.2025.1576461. eCollection 2025.

Application of Large Language Models in Stroke Rehabilitation Health Education: 2-Phase Study.

J Med Internet Res. 2025 Jul 22;27:e73226. doi: 10.2196/73226.

Large language models in the management of chronic ocular diseases: a scoping review.

Front Cell Dev Biol. 2025 Jun 18;13:1608988. doi: 10.3389/fcell.2025.1608988. eCollection 2025.

本文引用的文献

iScience. 2023 Oct 10;26(11):108163. doi: 10.1016/j.isci.2023.108163. eCollection 2023 Nov 17.

Autonomous AI systems in the face of liability, regulations and costs.

NPJ Digit Med. 2023 Oct 6;6(1):185. doi: 10.1038/s41746-023-00929-1.

ChatGPT: promise and challenges for deployment in low- and middle-income countries.

Lancet Reg Health West Pac. 2023 Sep 15;41:100905. doi: 10.1016/j.lanwpc.2023.100905. eCollection 2023 Dec.

Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.

EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.

Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.

JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.

Large language models in medicine.

Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.

Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.

Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.

A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics.

Nat Biomed Eng. 2023 Jun;7(6):743-755. doi: 10.1038/s41551-023-01045-x. Epub 2023 Jun 12.

Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.

Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.

ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health.

Front Public Health. 2023 Apr 25;11:1166120. doi: 10.3389/fpubh.2023.1166120. eCollection 2023.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估四种大型语言模型解答中国患者关于干眼症问题的性能：一项两阶段研究。

Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.

作者信息

机构信息

出版信息

PURPOSE

DESIGN

SUBJECTS

METHODS

MAIN OUTCOME MEASURES

RESULTS

CONCLUSIONS

目的

设计

研究对象

方法

主要观察指标

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献