评估大型语言模型在与上睑下垂相关问题中的表现：一项跨语言研究。

Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.

作者信息

Niu Ling-Han, Wei Li, Qin Bixuan, Chen Tao, Dong Li, He Yueqing, Jiang Xue, Wang Mingyang, Ma Lan, Geng Jialu, Wang Lechen, Li Dongmei

机构信息

Beijing Tongren Eye Center, and Beijing Ophthalmology Visual Science Key Lab, Beijing Tongren Hospital, Capital Medical University, Beijing, People's Republic of China.

Mingsii Co., Ltd, Beijing, People's Republic of China.

出版信息

Transl Vis Sci Technol. 2025 Jul 1;14(7):9. doi: 10.1167/tvst.14.7.9.

DOI:10.1167/tvst.14.7.9

PMID:40668049

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12279073/

Abstract

PURPOSE

The purpose of this study was to evaluate the performance of large language models (LLMs)-GPT-4, GPT-4o, Qwen2, and Qwen2.5-in addressing patient- and clinician-focused questions on ptosis-related inquiries, emphasizing cross-lingual applicability and patient-centric assessment.

METHODS

We collected 11 patient-centric and 50 doctor-centric questions covering ptosis symptoms, treatment, and postoperative care. Responses generated by GPT-4, GPT-4o, Qwen2, and Qwen2.5 were evaluated using predefined criteria: accuracy, sufficiency, clarity, and depth (doctor questions); and helpfulness, clarity, and empathy (patient questions). Clinical assessments involved 30 patients with ptosis and 8 oculoplastic surgeons rating responses on a 5-point Likert scale.

RESULTS

For doctor questions, GPT-4o outperformed Qwen2.5 in overall performance (53.1% vs. 18.8%, P = 0.035) and completeness (P = 0.049). For patient questions, GPT-4o scored higher in helpfulness (mean rank = 175.28 vs. 155.72, P = 0.035), with no significant differences in clarity or empathy. Qwen2.5 exhibited superior Chinese-language clarity compared to English (P = 0.023).

CONCLUSIONS

LLMs, particularly GPT-4o, demonstrate robust performance in ptosis-related inquiries, excelling in English and offering clinically valuable insights. Qwen2.5 showed advantages in Chinese clarity. Although promising for patient education and clinician support, these models require rigorous validation, domain-specific training, and cultural adaptation before clinical deployment. Future efforts should focus on refining multilingual capabilities and integrating real-time expert oversight to ensure safety and relevance in diverse healthcare contexts.

TRANSLATIONAL RELEVANCE

This study bridges artificial intelligence (AI) advancements with clinical practice by demonstrating how optimized LLMs can enhance patient education and cross-linguistic clinician support tools in ptosis-related inquiries.

摘要

目的

本研究旨在评估大语言模型（LLMs）——GPT-4、GPT-4o、文心一言2.0和文心一言2.5——在解决以患者和临床医生为中心的上睑下垂相关问题方面的表现，强调跨语言适用性和以患者为中心的评估。

方法

我们收集了11个以患者为中心和50个以医生为中心的问题，涵盖上睑下垂症状、治疗和术后护理。使用预定义标准评估GPT-4、GPT-4o、文心一言2.0和文心一言2.5生成的回答：准确性、充分性、清晰度和深度（医生问题）；以及帮助性、清晰度和同理心（患者问题）。临床评估涉及30名上睑下垂患者和8名眼科整形医生，他们以5分李克特量表对回答进行评分。

结果

对于医生问题，GPT-4o在总体表现（53.1%对18.8%，P = 0.035）和完整性（P = 0.049）方面优于文心一言2.5。对于患者问题，GPT-4o在帮助性方面得分更高（平均排名 = 175.28对155.72，P = 0.035），在清晰度或同理心方面无显著差异。与英语相比，文心一言2.5在中文清晰度方面表现更优（P = 0.023）。

结论

大语言模型，尤其是GPT-4o，在上睑下垂相关问题的询问中表现出强大的性能，在英语方面表现出色并提供了具有临床价值的见解。文心一言2.

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4a71/12279073/22745c404397/tvst-14-7-9-f001.jpg

相似文献

Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.

Transl Vis Sci Technol. 2025 Jul 1;14(7):9. doi: 10.1167/tvst.14.7.9.

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China's Rare Disease Catalog: Comparative Study.

J Med Internet Res. 2025 Jun 18;27:e69929. doi: 10.2196/69929.

Assessing the Accuracy and Reliability of Large Language Models in Psychiatry Using Standardized Multiple-Choice Questions: Cross-Sectional Study.

J Med Internet Res. 2025 May 20;27:e69910. doi: 10.2196/69910.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Potential of ChatGPT in youth mental health emergency triage: Comparative analysis with clinicians.

PCN Rep. 2025 Jul 15;4(3):e70159. doi: 10.1002/pcn5.70159. eCollection 2025 Sep.

Evaluating a Large Language Model in Translating Patient Instructions to Spanish Using a Standardized Framework.

JAMA Pediatr. 2025 Jul 7. doi: 10.1001/jamapediatrics.2025.1729.

Optimizing patient education for radioactive iodine therapy and the role of ChatGPT incorporating chain-of-thought technique: ChatGPT questionnaire.

Digit Health. 2025 Jul 7;11:20552076251357468. doi: 10.1177/20552076251357468. eCollection 2025 Jan-Dec.

Development and Validation of a Large Language Model-Powered Chatbot for Neurosurgery: Mixed Methods Study on Enhancing Perioperative Patient Education.

J Med Internet Res. 2025 Jul 15;27:e74299. doi: 10.2196/74299.

Large Language Models and Empathy: Systematic Review.

J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.

Evaluating Large Language Models for Preoperative Patient Education in Superior Capsular Reconstruction: Comparative Study of Claude, GPT, and Gemini.

JMIR Perioper Med. 2025 Jun 12;8:e70047. doi: 10.2196/70047.

本文引用的文献

Comparative performance analysis of global and chinese-domain large language models for myopia.

Eye (Lond). 2025 Apr 13. doi: 10.1038/s41433-025-03775-5.

Large language models for diabetes training: a prospective study.

Sci Bull (Beijing). 2025 Mar 30;70(6):934-942. doi: 10.1016/j.scib.2025.01.034. Epub 2025 Jan 27.

From GPT to DeepSeek: Significant gaps remain in realizing AI in healthcare.

J Biomed Inform. 2025 Mar;163:104791. doi: 10.1016/j.jbi.2025.104791. Epub 2025 Feb 10.

How China created AI model DeepSeek and shocked the world.

Nature. 2025 Feb;638(8050):300-301. doi: 10.1038/d41586-025-00259-0.

Comparing the Accuracy and Readability of Glaucoma-related Question Responses and Educational Materials by Google and ChatGPT.

J Curr Glaucoma Pract. 2024 Jul-Sep;18(3):110-116. doi: 10.5005/jp-journals-10078-1448. Epub 2024 Oct 29.

Evaluation of the Appropriateness and Readability of ChatGPT-4 Responses to Patient Queries on Uveitis.

Ophthalmol Sci. 2024 Aug 8;5(1):100594. doi: 10.1016/j.xops.2024.100594. eCollection 2025 Jan-Feb.

ChatGPT for Addressing Patient-centered Frequently Asked Questions in Glaucoma Clinical Practice.

Ophthalmol Glaucoma. 2025 Mar-Apr;8(2):157-166. doi: 10.1016/j.ogla.2024.10.005. Epub 2024 Oct 16.

Artificial intelligence chatbots as sources of patient education material for cataract surgery: ChatGPT-4 versus Google Bard.

BMJ Open Ophthalmol. 2024 Oct 17;9(1):e001824. doi: 10.1136/bmjophth-2024-001824.

Performance of Large Language Models on Medical Oncology Examination Questions.

JAMA Netw Open. 2024 Jun 3;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641.

Chat-ePRO: Development and pilot study of an electronic patient-reported outcomes system based on ChatGPT.

J Biomed Inform. 2024 Jun;154:104651. doi: 10.1016/j.jbi.2024.104651. Epub 2024 May 3.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

评估大型语言模型在与上睑下垂相关问题中的表现：一项跨语言研究。

Evaluating Large Language Models in Ptosis-Related inquiries: A Cross-Lingual Study.

作者信息

Niu Ling-Han, Wei Li, Qin Bixuan, Chen Tao, Dong Li, He Yueqing, Jiang Xue, Wang Mingyang, Ma Lan, Geng Jialu, Wang Lechen, Li Dongmei

机构信息

Beijing Tongren Eye Center, and Beijing Ophthalmology Visual Science Key Lab, Beijing Tongren Hospital, Capital Medical University, Beijing, People's Republic of China.

Mingsii Co., Ltd, Beijing, People's Republic of China.