• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估四种大型语言模型解答中国患者关于干眼症问题的性能:一项两阶段研究。

Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.

作者信息

Shi Runhan, Liu Steven, Xu Xinwei, Ye Zhengqiang, Yang Jin, Le Qihua, Qiu Jini, Tian Lijia, Wei Anji, Shan Kun, Zhao Chen, Sun Xinghuai, Zhou Xingtao, Hong Jiaxu

机构信息

Department of Ophthalmology and Vision Science, State Key Laboratory of Molecular Engineering of Polymerse, Fudan University, Shanghai, 200031, China.

NHC Key laboratory of molecular engineering of polymers, Fudan University, Shanghai, 200031, China.

出版信息

Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.

DOI:10.1016/j.heliyon.2024.e34391
PMID:39113991
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11305187/
Abstract

PURPOSE

To evaluate the performance of four large language models (LLMs)-GPT-4, PaLM 2, Qwen, and Baichuan 2-in generating responses to inquiries from Chinese patients about dry eye disease (DED).

DESIGN

Two-phase study, including a cross-sectional test in the first phase and a real-world clinical assessment in the second phase.

SUBJECTS

Eight board-certified ophthalmologists and 46 patients with DED.

METHODS

The chatbots' responses to Chinese patients' inquiries about DED were assessed by the evaluation. In the first phase, six senior ophthalmologists subjectively rated the chatbots' responses using a 5-point Likert scale across five domains: correctness, completeness, readability, helpfulness, and safety. Objective readability analysis was performed using a Chinese readability analysis platform. In the second phase, 46 representative patients with DED asked the two language models (GPT-4 and Baichuan 2) that performed best in the in the first phase questions and then rated the answers for satisfaction and readability. Two senior ophthalmologists then assessed the responses across the five domains.

MAIN OUTCOME MEASURES

Subjective scores for the five domains and objective readability scores in the first phase. The patient satisfaction, readability scores, and subjective scores for the five-domains in the second phase.

RESULTS

In the first phase, GPT-4 exhibited superior performance across the five domains (correctness: 4.47; completeness: 4.39; readability: 4.47; helpfulness: 4.49; safety: 4.47,  < 0.05). However, the readability analysis revealed that GPT-4's responses were highly complex, with an average score of 12.86 ( < 0.05) compared to scores of 10.87, 11.53, and 11.26 for Qwen, Baichuan 2, and PaLM 2, respectively. In the second phase, as shown by the scores for the five domains, both GPT-4 and Baichuan 2 were adept in answering questions posed by patients with DED. However, the completeness of Baichuan 2's responses was relatively poor (4.04 vs. 4.48 for GPT-4,  < 0.05). Nevertheless, Baichuan 2's recommendations more comprehensible than those of GPT-4 (patient readability: 3.91 vs. 4.61,  < 0.05; ophthalmologist readability: 2.67 vs. 4.33).

CONCLUSIONS

The findings underscore the potential of LLMs, particularly that of GPT-4 and Baichuan 2, in delivering accurate and comprehensive responses to questions from Chinese patients about DED.

摘要

目的

评估四种大语言模型(LLMs)——GPT-4、PaLM 2、通义千问和百川2——对中国干眼症(DED)患者询问生成回答的表现。

设计

两阶段研究,第一阶段为横断面测试,第二阶段为真实世界临床评估。

研究对象

八位获得委员会认证的眼科医生和46名干眼症患者。

方法

通过评估来评定聊天机器人对中国干眼症患者询问的回答。在第一阶段,六位资深眼科医生使用5点李克特量表,从正确性、完整性、可读性、帮助性和安全性五个领域对聊天机器人的回答进行主观评分。使用中文可读性分析平台进行客观可读性分析。在第二阶段,46名有代表性的干眼症患者向在第一阶段表现最佳的两种语言模型(GPT-4和百川2)提问,然后对答案的满意度和可读性进行评分。随后,两位资深眼科医生对五个领域的回答进行评估。

主要观察指标

第一阶段五个领域的主观评分和客观可读性评分。第二阶段患者的满意度、可读性评分以及五个领域的主观评分。

结果

在第一阶段,GPT-4在五个领域均表现出卓越性能(正确性:4.47;完整性:4.39;可读性:4.47;帮助性:4.49;安全性:4.47,<0.05)。然而,可读性分析显示,GPT-4的回答非常复杂,平均得分为12.86(<0.05),而通义千问、百川2和PaLM 2的得分分别为10.87、11.53和11.26。在第二阶段,从五个领域的评分来看,GPT-4和百川2都擅长回答干眼症患者提出的问题。然而,百川2回答的完整性相对较差(GPT-4为4.48,百川2为4.04,<0.05)。尽管如此,百川2的建议比GPT-4的更易懂(患者可读性:3.91对4.61,<0.05;眼科医生可读性:2.67对4.33)。

结论

研究结果强调了大语言模型的潜力,尤其是GPT-4和百川2在为中国干眼症患者的问题提供准确全面回答方面的潜力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/7a88341c8a68/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/062669bbb39b/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/e6f42509cb4a/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/1d5b23f64cbd/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/7a88341c8a68/gr4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/062669bbb39b/gr1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/e6f42509cb4a/gr2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/1d5b23f64cbd/gr3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4b57/11305187/7a88341c8a68/gr4.jpg

相似文献

1
Benchmarking four large language models' performance of addressing Chinese patients' inquiries about dry eye disease: A two-phase study.评估四种大型语言模型解答中国患者关于干眼症问题的性能:一项两阶段研究。
Heliyon. 2024 Jul 14;10(14):e34391. doi: 10.1016/j.heliyon.2024.e34391. eCollection 2024 Jul 30.
2
Evaluating the effectiveness of large language models in patient education for conjunctivitis.评估大语言模型在结膜炎患者教育中的有效性。
Br J Ophthalmol. 2025 Jan 28;109(2):185-191. doi: 10.1136/bjo-2024-325599.
3
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
4
Quality of Answers of Generative Large Language Models vs Peer Patients for Interpreting Lab Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同侪患者为非专业患者解读实验室检查结果的答案质量:评估研究
ArXiv. 2024 Jan 23:arXiv:2402.01693v1.
5
Safety and quality of AI chatbots for drug-related inquiries: A real-world comparison with licensed pharmacists.用于药物相关咨询的人工智能聊天机器人的安全性和质量:与持牌药剂师的真实世界比较。
Digit Health. 2024 May 15;10:20552076241253523. doi: 10.1177/20552076241253523. eCollection 2024 Jan-Dec.
6
Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources.大型语言模型和减重手术患者教育:GPT-3.5、GPT-4、Bard 与在线机构资源的可读性比较分析。
Surg Endosc. 2024 May;38(5):2522-2532. doi: 10.1007/s00464-024-10720-2. Epub 2024 Mar 12.
7
Accuracy, readability, and understandability of large language models for prostate cancer information to the public.大语言模型向公众提供前列腺癌信息的准确性、可读性和可理解性。
Prostate Cancer Prostatic Dis. 2024 May 14. doi: 10.1038/s41391-024-00826-y.
8
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
9
Assessing GPT-4's Performance in Delivering Medical Advice: Comparative Analysis With Human Experts.评估 GPT-4 提供医疗建议的表现:与人类专家的比较分析。
JMIR Med Educ. 2024 Jul 8;10:e51282. doi: 10.2196/51282.
10
Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model.使用检索增强语言模型提高GPT-3/4在生物医学数据上的结果准确性。
PLOS Digit Health. 2024 Aug 21;3(8):e0000568. doi: 10.1371/journal.pdig.0000568. eCollection 2024 Aug.

引用本文的文献

1
Assessing ChatGPT's Educational Potential in Lung Cancer Radiotherapy From Clinician and Patient Perspectives: Content Quality and Readability Analysis.从临床医生和患者角度评估ChatGPT在肺癌放疗中的教育潜力:内容质量与可读性分析
JMIR Cancer. 2025 Aug 13;11:e69783. doi: 10.2196/69783.
2
Assessing the adherence of large language models to clinical practice guidelines in Chinese medicine: a content analysis.评估大型语言模型对中医临床实践指南的遵循情况:一项内容分析
Front Pharmacol. 2025 Jul 25;16:1649041. doi: 10.3389/fphar.2025.1649041. eCollection 2025.
3
Machine learning approaches for EGFR mutation status prediction in NSCLC: an updated systematic review.

本文引用的文献

1
Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries.流行的大语言模型聊天机器人在回答眼部症状查询时的准确性、全面性和自我意识。
iScience. 2023 Oct 10;26(11):108163. doi: 10.1016/j.isci.2023.108163. eCollection 2023 Nov 17.
2
Autonomous AI systems in the face of liability, regulations and costs.面对责任、法规和成本的自主人工智能系统。
NPJ Digit Med. 2023 Oct 6;6(1):185. doi: 10.1038/s41746-023-00929-1.
3
ChatGPT: promise and challenges for deployment in low- and middle-income countries.
用于非小细胞肺癌中表皮生长因子受体突变状态预测的机器学习方法:一项更新的系统评价
Front Oncol. 2025 Jul 10;15:1576461. doi: 10.3389/fonc.2025.1576461. eCollection 2025.
4
Application of Large Language Models in Stroke Rehabilitation Health Education: 2-Phase Study.大语言模型在中风康复健康教育中的应用:两阶段研究
J Med Internet Res. 2025 Jul 22;27:e73226. doi: 10.2196/73226.
5
Large language models in the management of chronic ocular diseases: a scoping review.大语言模型在慢性眼病管理中的应用:一项范围综述
Front Cell Dev Biol. 2025 Jun 18;13:1608988. doi: 10.3389/fcell.2025.1608988. eCollection 2025.
ChatGPT:在低收入和中等收入国家部署的前景与挑战。
Lancet Reg Health West Pac. 2023 Sep 15;41:100905. doi: 10.1016/j.lanwpc.2023.100905. eCollection 2023 Dec.
4
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
5
Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.眼科医生与大型语言模型聊天机器人对在线患者眼部护理问题的回复比较。
JAMA Netw Open. 2023 Aug 1;6(8):e2330320. doi: 10.1001/jamanetworkopen.2023.30320.
6
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.
7
Performance of Generative Large Language Models on Ophthalmology Board-Style Questions.生成式大型语言模型在眼科 Board 式问题中的表现。
Am J Ophthalmol. 2023 Oct;254:141-149. doi: 10.1016/j.ajo.2023.05.024. Epub 2023 Jun 18.
8
A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics.一种基于变压器的表示学习模型,可统一处理临床诊断的多模态输入。
Nat Biomed Eng. 2023 Jun;7(6):743-755. doi: 10.1038/s41551-023-01045-x. Epub 2023 Jun 12.
9
Appropriateness and Readability of ChatGPT-4-Generated Responses for Surgical Treatment of Retinal Diseases.ChatGPT-4 生成的回复在视网膜疾病手术治疗中的适宜性和可读性。
Ophthalmol Retina. 2023 Oct;7(10):862-868. doi: 10.1016/j.oret.2023.05.022. Epub 2023 Jun 3.
10
ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health.ChatGPT 和大型语言模型的兴起:公共卫生领域新的 AI 驱动的信息疫情威胁。
Front Public Health. 2023 Apr 25;11:1166120. doi: 10.3389/fpubh.2023.1166120. eCollection 2023.