大语言模型关于梨状肌综合征的能力：质量、准确性、完整性和可读性研究。

Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.

作者信息

Dede Burak Tayyip, Oğuz Muhammed, Alyanak Bülent, Bağcıer Fatih, Yıldızgören Mustafa Turgut

机构信息

Department of Physical Medicine and Rehabilitation, Prof Dr Cemil Taşcıoğlu City Hospital, Istanbul, Turkey.

Department of Physical Medicine and Rehabilitation, Istanbul Training and Research Hospital, Istanbul, Turkey.

出版信息

HSS J. 2025 May 20:15563316251340697. doi: 10.1177/15563316251340697.

DOI:10.1177/15563316251340697

PMID:40405920

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12092406/

Abstract

The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). : We sought to characterize LLM responses to questions about piriformis syndrome (PS). : On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. : The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. : Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.

摘要

人工智能的普及导致患者广泛使用大语言模型（LLMs）。我们试图描述大语言模型对梨状肌综合征（PS）相关问题的回答情况。2024年8月15日，我们让3个大语言模型——ChatGPT-4、Copilot和Gemini——回答谷歌趋势追踪的关于PS的25个最常见问题。我们根据李克特量表评估回答的准确性和完整性。我们使用患者质量信息保障（EQIP）工具评估回答的质量，并使用弗莱什-金凯德易读性（FKRE）和弗莱什-金凯德年级水平（FKGL）分数评估可读性。从ChatGPT、Copilot和Gemini获得的回答的平均完整性分数分别为2.8±0.3、2.2±0.6和2.6±0.4。大语言模型之间的平均完整性分数存在显著差异。在两两比较中，ChatGPT和Gemini优于Copilot。大语言模型在平均准确性分数方面没有显著差异。在可读性分析中，FKRE分数方面没有发现显著差异。然而，在FKGL分数方面发现了显著差异。根据EQIP分数进行的质量分析中，大语言模型之间存在显著差异。尽管在医疗保健中使用大语言模型很有前景，但我们的研究结果表明，这些技术需要改进，以便在针对普通受众的PS相关问题上在准确性、完整性、质量和可读性方面表现得更好。

相似文献

Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.

HSS J. 2025 May 20:15563316251340697. doi: 10.1177/15563316251340697.

Parental education in pediatric dysphagia: A comparative analysis of three large language models.

J Pediatr Gastroenterol Nutr. 2025 Jul;81(1):18-26. doi: 10.1002/jpn3.70069. Epub 2025 May 8.

Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.

Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.

Evaluating the reliability of the responses of large language models to keratoconus-related questions.

Clin Exp Optom. 2024 Oct 24:1-8. doi: 10.1080/08164622.2024.2419524.

Readability, accuracy and appropriateness and quality of AI chatbot responses as a patient information source on root canal retreatment: A comparative assessment.

Int J Med Inform. 2025 Sep;201:105948. doi: 10.1016/j.ijmedinf.2025.105948. Epub 2025 Apr 25.

Evaluation of the reliability and readability of answers given by chatbots to frequently asked questions about endophthalmitis: A cross-sectional study on chatbots.

Health Informatics J. 2024 Oct-Dec;30(4):14604582241304679. doi: 10.1177/14604582241304679.

Assessing the Responses of Large Language Models (ChatGPT-4, Claude 3, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Retinopathy of Prematurity: A Study on Readability and Appropriateness.

J Pediatr Ophthalmol Strabismus. 2025 Mar-Apr;62(2):84-95. doi: 10.3928/01913913-20240911-05. Epub 2024 Oct 28.

Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists' Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study.

JMIR Form Res. 2025 Feb 5;9:e56126. doi: 10.2196/56126.

Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.

Br J Dermatol. 2025 Jan 24;192(2):306-315. doi: 10.1093/bjd/ljae377.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

本文引用的文献

Exploring the Role of ChatGPT-4, BingAI, and Gemini as Virtual Consultants to Educate Families about Retinopathy of Prematurity.

Children (Basel). 2024 Jun 20;11(6):750. doi: 10.3390/children11060750.

Comparison of ChatGPT, Gemini, and Le Chat with physician interpretations of medical laboratory questions from an online health forum.

Clin Chem Lab Med. 2024 May 29;62(12):2425-2434. doi: 10.1515/cclm-2024-0246. Print 2024 Nov 26.

Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.

Cureus. 2024 May 9;16(5):e59960. doi: 10.7759/cureus.59960. eCollection 2024 May.

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.

Eur J Orthod. 2024 Apr 13. doi: 10.1093/ejo/cjae017.

Responses of Five Different Artificial Intelligence Chatbots to the Top Searched Queries About Erectile Dysfunction: A Comparative Analysis.

J Med Syst. 2024 Apr 3;48(1):38. doi: 10.1007/s10916-024-02056-0.

Artificial intelligence insights into osteoporosis: assessing ChatGPT's information quality and readability.

Arch Osteoporos. 2024 Mar 19;19(1):17. doi: 10.1007/s11657-024-01376-5.

The performance of artificial intelligence models in generating responses to general orthodontic questions: ChatGPT vs Google Bard.

Am J Orthod Dentofacial Orthop. 2024 Jun;165(6):652-662. doi: 10.1016/j.ajodo.2024.01.012. Epub 2024 Mar 15.

Can natural language processing serve as a consultant in oral surgery?

J Stomatol Oral Maxillofac Surg. 2024 Jun;125(3):101724. doi: 10.1016/j.jormas.2023.101724. Epub 2023 Dec 3.

Accuracy and Reliability of Chatbot Responses to Physician Questions.

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.

Rheumatol Int. 2024 Mar;44(3):509-515. doi: 10.1007/s00296-023-05473-5. Epub 2023 Sep 25.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大语言模型关于梨状肌综合征的能力：质量、准确性、完整性和可读性研究。

Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献