Alyanak Bülent, Dede Burak Tayyip, Bağcıer Fatih, Akaltun Mazlum Serdar
Department of Physical Medicine and Rehabilitation, Gölcük Necati Çelik State Hospital, Kocaeli, Turkey.
Department of Physical Medicine and Rehabilitation, Prof. Dr. Cemil Tascioglu City Hospital, Istanbul, Turkey.
J Pediatr Gastroenterol Nutr. 2025 Jul;81(1):18-26. doi: 10.1002/jpn3.70069. Epub 2025 May 8.
This study evaluates the effectiveness of three widely used large language models (LLMs)-ChatGPT-4, Copilot, and Gemini-in providing accurate, reliable, and understandable answers to frequently asked questions about pediatric dysphagia.
Twenty-five questions, selected based on Google Trends data, were presented to ChatGPT-4, Copilot, and Gemini, and the responses were evaluated using a 5-point Likert scale for accuracy, the Ensuring Quality Information for Patients (EQIP) and DISCERN scales for information quality and reliability, and the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease (FRE) scores for readability. The performance of ChatGPT-4, Copilot, and Gemini was assessed by presenting the same set of questions at three different time points: August, September, and October 2024. Statistical analyses included analysis of variance, Kruskal-Wallis tests, and post hoc comparisons, with p values below 0.05 considered significant.
ChatGPT-4 achieved the highest mean accuracy score (4.1 ± 0.7) compared to Copilot (3.1 ± 0.7) and Gemini (3.8 ± 0.8), with significant differences observed in quality ratings (p < 0.001 and p < 0.05, respectively). EQIP and DISCERN scores further confirmed the superior performance of ChatGPT-4. In terms of readability, Gemini achieved the highest scores (FRE = 48.7 ± 9.9 and FKGL = 10.1 ± 1.6).
While ChatGPT-4 generally provided more accurate and reliable information, Gemini produced more readable content. However, variability in overall information quality indicates that, although LLMs hold potential as tools for pediatric dysphagia education, further improvements are necessary to ensure consistent delivery of reliable and accessible information.
本研究评估了三种广泛使用的大语言模型(LLMs)——ChatGPT-4、Copilot和Gemini——在回答有关小儿吞咽困难常见问题时提供准确、可靠且易懂答案的有效性。
根据谷歌趋势数据挑选出25个问题,分别呈现给ChatGPT-4、Copilot和Gemini,并使用5点李克特量表评估回答的准确性,使用患者质量信息保障(EQIP)和辨别量表评估信息质量和可靠性,使用弗莱什-金凯德年级水平(FKGL)和弗莱什阅读简易度(FRE)分数评估可读性。在2024年8月、9月和10月这三个不同时间点向ChatGPT-4、Copilot和Gemini提出同一组问题,以此评估它们的表现。统计分析包括方差分析、克鲁斯卡尔-沃利斯检验和事后比较,p值低于0.05被视为具有统计学意义。
与Copilot(3.1±0.7)和Gemini(3.8±0.8)相比,ChatGPT-4获得了最高的平均准确度得分(4.1±0.7),在质量评级上存在显著差异(分别为p<0.001和p<0.05)。EQIP和辨别量表分数进一步证实了ChatGPT-4的卓越表现。在可读性方面,Gemini获得了最高分(FRE=48.7±9.9,FKGL=10.1±1.6)。
虽然ChatGPT-4通常提供更准确可靠的信息,但Gemini生成的内容更具可读性。然而,整体信息质量的差异表明,尽管大语言模型作为小儿吞咽困难教育工具具有潜力,但仍需进一步改进以确保持续提供可靠且易于获取的信息。