Li Xuewei, Zhang Yixuan, Zheng Tonglei, Deng Yuanqi, Lu Yuchang, Hu Jie, Chen Sitong, Li Yan, Wang Kai
Department of Ophthalmology, Peking University People's Hospital, Eye Diseases and Optometry Institute, Beijing, China.
Institute of Medical Technology, Peking University Health Science Center, Beijing, China.
Digit Health. 2025 Jul 30;11:20552076251362338. doi: 10.1177/20552076251362338. eCollection 2025 Jan-Dec.
To evaluate the ability of large language models (LLMs) to produce patient education materials for myopic children and their parents.
Thirty-five common myopia-related questions were used with two distinct prompts to produce responses aimed at adults (Prompt A) and children (Prompt B). Five ophthalmologists evaluated the responses using a 5-point Likert scale for correctness, completeness, conciseness, and potential harm. Readability was assessed via Flesch-Kincaid scores. The Kruskal-Wallis and Mann-Whitney tests were used to identify significant differences in LLM performance.
ChatGPT 4o achieved the most positive ratings ("good" and above) in correctness (Prompt A: 91%; Prompt B: 83%) and conciseness (Prompt A: 79%; Prompt B: 63%), as well as the lowest negative ratings in potential harm ratings ("not at all" and "slightly," Prompt A: 99%; Prompt B: 97%) in the generation of educational materials for both adults and children (all < 0.001). In terms of completeness, the results varied between the two prompts. Specifically, in Prompt A, ChatGPT 4.0 demonstrated the highest level of completeness (ChatGPT 4o: 69%, ChatGPT 4.0: 74%, ChatGPT 3.5: 51%, Gemini: 73%, < 0.001), whereas in Prompt B, ChatGPT 4o achieved the highest score (ChatGPT 4o: 71%, ChatGPT 4.0: 65%, ChatGPT 3.5: 38%, Gemini: 46%, < 0.001). The responses generated with Prompt B were significantly more readable than those generated with Prompt A across all LLMs ( ≤ 0.001).
Large language models, particularly ChatGPT 4o, hold potential for delivering effective patient education materials on myopia for both adult and pediatric populations. While generally effective, LLMs have limitations for complex medical queries, necessitating continued refinement for reliable clinical use.
评估大语言模型(LLMs)为近视儿童及其家长生成患者教育材料的能力。
使用35个常见的与近视相关的问题,通过两个不同的提示来生成针对成年人(提示A)和儿童(提示B)的回答。五位眼科医生使用5点李克特量表对回答的正确性、完整性、简洁性和潜在危害进行评估。通过弗莱什-金凯德分数评估可读性。使用克鲁斯卡尔-沃利斯检验和曼-惠特尼检验来确定大语言模型性能的显著差异。
ChatGPT 4o在生成针对成年人和儿童的教育材料时,在正确性(提示A:91%;提示B:83%)和简洁性(提示A:79%;提示B:63%)方面获得了最积极的评分(“良好”及以上),在潜在危害评分中获得了最低的负面评分(“完全没有”和“轻微”,提示A:99%;提示B:97%)(所有P均<0.001)。在完整性方面,两个提示的结果有所不同。具体而言,在提示A中,ChatGPT 4.0表现出最高的完整性水平(ChatGPT 4o:69%,ChatGPT 4.0:74%,ChatGPT 3.5:51%,Gemini:73%,P<0.001),而在提示B中,ChatGPT 4o获得了最高分(ChatGPT 4o:71%,ChatGPT 4.0:65%,ChatGPT 3.5:38%,Gemini:46%,P<0.001)。在所有大语言模型中,提示B生成的回答比提示A生成的回答可读性显著更高(P≤0.001)。
大语言模型,尤其是ChatGPT 4o,有潜力为成人和儿童群体提供关于近视的有效患者教育材料。虽然大语言模型总体上有效,但对于复杂的医学问题存在局限性,需要持续改进以用于可靠的临床应用。