Niu Ling-Han, Wei Li, Qin Bixuan, Chen Tao, Dong Li, He Yueqing, Jiang Xue, Wang Mingyang, Ma Lan, Geng Jialu, Wang Lechen, Li Dongmei
Beijing Tongren Eye Center, and Beijing Ophthalmology Visual Science Key Lab, Beijing Tongren Hospital, Capital Medical University, Beijing, People's Republic of China.
Mingsii Co., Ltd, Beijing, People's Republic of China.
Transl Vis Sci Technol. 2025 Jul 1;14(7):9. doi: 10.1167/tvst.14.7.9.
PURPOSE: The purpose of this study was to evaluate the performance of large language models (LLMs)-GPT-4, GPT-4o, Qwen2, and Qwen2.5-in addressing patient- and clinician-focused questions on ptosis-related inquiries, emphasizing cross-lingual applicability and patient-centric assessment. METHODS: We collected 11 patient-centric and 50 doctor-centric questions covering ptosis symptoms, treatment, and postoperative care. Responses generated by GPT-4, GPT-4o, Qwen2, and Qwen2.5 were evaluated using predefined criteria: accuracy, sufficiency, clarity, and depth (doctor questions); and helpfulness, clarity, and empathy (patient questions). Clinical assessments involved 30 patients with ptosis and 8 oculoplastic surgeons rating responses on a 5-point Likert scale. RESULTS: For doctor questions, GPT-4o outperformed Qwen2.5 in overall performance (53.1% vs. 18.8%, P = 0.035) and completeness (P = 0.049). For patient questions, GPT-4o scored higher in helpfulness (mean rank = 175.28 vs. 155.72, P = 0.035), with no significant differences in clarity or empathy. Qwen2.5 exhibited superior Chinese-language clarity compared to English (P = 0.023). CONCLUSIONS: LLMs, particularly GPT-4o, demonstrate robust performance in ptosis-related inquiries, excelling in English and offering clinically valuable insights. Qwen2.5 showed advantages in Chinese clarity. Although promising for patient education and clinician support, these models require rigorous validation, domain-specific training, and cultural adaptation before clinical deployment. Future efforts should focus on refining multilingual capabilities and integrating real-time expert oversight to ensure safety and relevance in diverse healthcare contexts. TRANSLATIONAL RELEVANCE: This study bridges artificial intelligence (AI) advancements with clinical practice by demonstrating how optimized LLMs can enhance patient education and cross-linguistic clinician support tools in ptosis-related inquiries.
目的:本研究旨在评估大语言模型(LLMs)——GPT-4、GPT-4o、文心一言2.0和文心一言2.5——在解决以患者和临床医生为中心的上睑下垂相关问题方面的表现,强调跨语言适用性和以患者为中心的评估。 方法:我们收集了11个以患者为中心和50个以医生为中心的问题,涵盖上睑下垂症状、治疗和术后护理。使用预定义标准评估GPT-4、GPT-4o、文心一言2.0和文心一言2.5生成的回答:准确性、充分性、清晰度和深度(医生问题);以及帮助性、清晰度和同理心(患者问题)。临床评估涉及30名上睑下垂患者和8名眼科整形医生,他们以5分李克特量表对回答进行评分。 结果:对于医生问题,GPT-4o在总体表现(53.1%对18.8%,P = 0.035)和完整性(P = 0.049)方面优于文心一言2.5。对于患者问题,GPT-4o在帮助性方面得分更高(平均排名 = 175.28对155.72,P = 0.035),在清晰度或同理心方面无显著差异。与英语相比,文心一言2.5在中文清晰度方面表现更优(P = 0.023)。 结论:大语言模型,尤其是GPT-4o,在上睑下垂相关问题的询问中表现出强大的性能,在英语方面表现出色并提供了具有临床价值的见解。文心一言2.
Transl Vis Sci Technol. 2025-7-1
Ophthalmic Plast Reconstr Surg. 2024-12-24
J Med Internet Res. 2024-12-11
Sci Bull (Beijing). 2025-3-30
J Biomed Inform. 2025-3
Ophthalmol Glaucoma. 2025
JAMA Netw Open. 2024-6-3