Huang Yangyi, Shi Runhan, Chen Can, Zhou Xueyi, Zhou Xingtao, Hong Jiaxu, Chen Zhi
Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China; NHC Key Laboratory of Myopia (Fudan University), Key Laboratory of Myopia, Chinese Academy of Medical Sciences, Shanghai 200031, China; Shanghai Research Center of Ophthalmology and Optometry, China; Shanghai Engineering Research Center of Laser and Autostereoscopic 3D for Vision Care, China.
Eye Institute and Department of Ophthalmology, Eye & ENT Hospital, Fudan University, Shanghai 200031, China.
Cont Lens Anterior Eye. 2025 Jun;48(3):102384. doi: 10.1016/j.clae.2025.102384. Epub 2025 Feb 11.
Large language models (LLMs) are gaining popularity in solving ophthalmic problems. However, their efficacy in patient education regarding orthokeratology, one of the main myopia control strategies, has yet to be determined.
This cross-sectional study established a question bank consisting of 24 orthokeratology-related questions used as queries for GTP-4, Qwen-72B, and Yi-34B to prompt responses in Chinese. Objective evaluations were conducted using an online platform. Subjective evaluations including correctness, relevance, readability, applicability, safety, clarity, helpfulness, and satisfaction were performed by experienced ophthalmologists and parents of myopic children using a 5-point Likert scale. The overall standardized scores were also calculated.
The word count of the responses from Qwen-72B (199.42 ± 76.82) was the lowest (P < 0.001), with no significant differences in recommended age among the LLMs. GPT-4 (3.79 ± 1.03) scored lower in readability than Yi-34B (4.65 ± 0.51) and Qwen-72B (4.65 ± 0.61) (P < 0.001). No significant differences in safety, relevance, correctness, and applicability were observed across the three LLMs. Parental evaluations rated all LLMs an average score exceeding 4.7 points, with GPT-4 outperforming the others in helpfulness (P = 0.004) and satisfaction (P = 0.016). Qwen-72B's overall standardized scores surpassed those of the other two LLMs (P = 0.048).
GPT-4 and the Chinese LLM Qwen-72B produced accurate and beneficial responses to inquiries on orthokeratology. Further enhancement to bolster precision is essential, particularly within diverse linguistic contexts.
大语言模型在解决眼科问题方面越来越受欢迎。然而,它们在作为主要近视控制策略之一的角膜塑形术患者教育中的效果尚未确定。
这项横断面研究建立了一个由24个与角膜塑形术相关的问题组成的题库,用作向GPT-4、文心一言72B和通义千问34B提问以获取中文回答。使用在线平台进行客观评估。由经验丰富的眼科医生和近视儿童的家长使用5点李克特量表进行主观评估,包括正确性、相关性、可读性、适用性、安全性、清晰度、有用性和满意度。还计算了总体标准化分数。
文心一言72B的回答字数(199.42 ± 76.82)最少(P < 0.001),各语言模型在推荐年龄方面无显著差异。GPT-4在可读性方面的得分(3.79 ± 1.03)低于通义千问34B(4.65 ± 0.51)和文心一言72B(4.65 ± 0.61)(P < 0.001)。在安全性、相关性、正确性和适用性方面,三个语言模型之间未观察到显著差异。家长评估中所有语言模型的平均得分超过4.7分,GPT-4在有用性(P = 0.004)和满意度(P = 0.016)方面表现优于其他模型。文心一言72B的总体标准化分数超过了其他两个语言模型(P = 0.048)。
GPT-4和中文语言模型文心一言72B对角膜塑形术相关询问给出了准确且有益的回答。进一步提高准确性至关重要,特别是在不同的语言环境中。