Dede Burak Tayyip, Oğuz Muhammed, Alyanak Bülent, Bağcıer Fatih, Yıldızgören Mustafa Turgut
Department of Physical Medicine and Rehabilitation, Prof Dr Cemil Taşcıoğlu City Hospital, Istanbul, Turkey.
Department of Physical Medicine and Rehabilitation, Istanbul Training and Research Hospital, Istanbul, Turkey.
HSS J. 2025 May 20:15563316251340697. doi: 10.1177/15563316251340697.
The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). : We sought to characterize LLM responses to questions about piriformis syndrome (PS). : On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. : The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. : Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.
人工智能的普及导致患者广泛使用大语言模型(LLMs)。我们试图描述大语言模型对梨状肌综合征(PS)相关问题的回答情况。2024年8月15日,我们让3个大语言模型——ChatGPT-4、Copilot和Gemini——回答谷歌趋势追踪的关于PS的25个最常见问题。我们根据李克特量表评估回答的准确性和完整性。我们使用患者质量信息保障(EQIP)工具评估回答的质量,并使用弗莱什-金凯德易读性(FKRE)和弗莱什-金凯德年级水平(FKGL)分数评估可读性。从ChatGPT、Copilot和Gemini获得的回答的平均完整性分数分别为2.8±0.3、2.2±0.6和2.6±0.4。大语言模型之间的平均完整性分数存在显著差异。在两两比较中,ChatGPT和Gemini优于Copilot。大语言模型在平均准确性分数方面没有显著差异。在可读性分析中,FKRE分数方面没有发现显著差异。然而,在FKGL分数方面发现了显著差异。根据EQIP分数进行的质量分析中,大语言模型之间存在显著差异。尽管在医疗保健中使用大语言模型很有前景,但我们的研究结果表明,这些技术需要改进,以便在针对普通受众的PS相关问题上在准确性、完整性、质量和可读性方面表现得更好。