人工智能对尾骨痛常见问题的回答：评估GPT-4o表现的准确性和一致性。

Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.

作者信息

Keles Aslinur, Illeez Ozge Gulsum, Erbagci Berkay, Giray Esra

机构信息

Department of Physical Medicine and Rehabilitation, Health Science University, Fatih Sultan Mehmet Training and Research Hospital, İstanbul, Türkiye.

出版信息

Arch Rheumatol. 2025 Mar 17;40(1):63-71. doi: 10.46497/ArchRheumatol.2025.10966. eCollection 2025 Mar.

DOI:10.46497/ArchRheumatol.2025.10966

PMID:40264482

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12010271/

Abstract

OBJECTIVES

This study aimed to assess whether GPT-4o's responses to patient-centered frequently asked questions about coccydynia are accurate and consistent when asked at different times and from different accounts.

MATERIALS AND METHODS

Questions were collected from medical websites, forums, and patient support groups and posed to GPT-4o. The responses were evaluated by two physiatrists for accuracy and consistency. Responses were categorized: correct and comprehensive, correct but not inadequate, partially correct and partially incorrect, and completely incorrect. Inconsistencies in scoring were resolved by an additional reviewer as needed. Statistical analysis, including Cohen's kappa for interreviewer reliability, was performed.

RESULTS

Of the 81 responses, 45.7% were rated as correct and comprehensive, while 49.4% were correct but incomplete. Only 4.9% of the responses contained partially incorrect information, and no responses were completely incorrect. The interreviewer agreement was substantial (kappa=0.67), but 75% of the responses differed between the two rounds. Notably, 34.9% of initially incomplete answers improved in the second round.

CONCLUSION

GPT-4o shows promise in providing accurate and generally reliable information about coccydynia. However, the variability observed in response consistency across repeated queries suggests that while the model is useful for patient education and general inquiries, it may not be suitable for providing specialized clinical knowledge without human oversight.

摘要

目的

本研究旨在评估当在不同时间从不同账户询问时，GPT-4o对以患者为中心的尾骨痛常见问题的回答是否准确和一致。

材料与方法

从医学网站、论坛和患者支持小组收集问题，并向GPT-4o提出。由两名物理治疗师对回答进行准确性和一致性评估。回答分为：正确且全面、正确但不充分、部分正确部分错误、完全错误。评分不一致时，根据需要由另一位审阅者解决。进行了统计分析，包括用于审阅者间信度的 Cohen's kappa分析。

结果

在81个回答中，45.7%被评为正确且全面，而49.4%正确但不完整。只有4.9%的回答包含部分错误信息，没有回答完全错误。审阅者间的一致性较高（kappa=0.67），但两轮回答中有75%不同。值得注意的是，34.9%最初不完整的回答在第二轮中有所改进。

结论

GPT-4o在提供有关尾骨痛的准确且总体可靠的信息方面显示出前景。然而，在重复查询中观察到的回答一致性差异表明，虽然该模型对患者教育和一般咨询有用，但在没有人工监督的情况下，它可能不适合提供专业的临床知识。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d36c/12010271/008302d2ef69/AR-2025-40-1-063-071-F1.jpg

相似文献

Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.人工智能对尾骨痛常见问题的回答：评估GPT-4o表现的准确性和一致性。

Arch Rheumatol. 2025 Mar 17;40(1):63-71. doi: 10.46497/ArchRheumatol.2025.10966. eCollection 2025 Mar.

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.甲状腺眼病与人工智能：ChatGPT-3.5、ChatGPT-4o和Gemini在患者信息传递方面的比较研究

Ophthalmic Plast Reconstr Surg. 2024 Dec 24. doi: 10.1097/IOP.0000000000002882.

Assessing ChatGPT for Clinical Decision-Making in Radiation Oncology, With Open-Ended Questions and Images.通过开放式问题和图像评估ChatGPT在放射肿瘤学临床决策中的应用

Pract Radiat Oncol. 2025 Apr 29. doi: 10.1016/j.prro.2025.04.009.

Generative pre-trained transformer 4o (GPT-4o) in solving text-based multiple response questions for European Diploma in Radiology (EDiR): a comparative study with radiologists.生成式预训练变换器4o（GPT-4o）用于解答欧洲放射学文凭（EDiR）基于文本的多项选择题：与放射科医生的对比研究

Insights Imaging. 2025 Mar 22;16(1):66. doi: 10.1186/s13244-025-01941-7.

An Evaluation of the Performance of OpenAI-o1 and GPT-4o in the Japanese National Examination for Physical Therapists.OpenAI-o1和GPT-4o在日本物理治疗师国家考试中的表现评估

Cureus. 2025 Jan 6;17(1):e76989. doi: 10.7759/cureus.76989. eCollection 2025 Jan.

ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.ChatGPT-4 在 USMLE 学科和临床技能中的全能表现：比较分析。

JMIR Med Educ. 2024 Nov 6;10:e63430. doi: 10.2196/63430.

Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations.GPT-4o 作为心肺复苏技能考试评估者的适用性。

Resuscitation. 2024 Nov;204:110404. doi: 10.1016/j.resuscitation.2024.110404. Epub 2024 Sep 28.

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.GPT-4o与人类考生：波兰牙科最终考试中的表现分析

Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.

Assessing the accuracy and clinical utility of GPT-4O in abnormal blood cell morphology recognition.评估GPT-4O在异常血细胞形态识别中的准确性和临床效用。

Digit Health. 2024 Nov 5;10:20552076241298503. doi: 10.1177/20552076241298503. eCollection 2024 Jan-Dec.

Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4.评估ChatGPT对放疗相关患者问题回答的质量和可靠性：与GPT-3.5和GPT-4的比较研究

JMIR Cancer. 2025 Apr 16;11:e63677. doi: 10.2196/63677.

引用本文的文献

Multimodal reasoning agent for enhanced ophthalmic decision-making: a preliminary real-world clinical validation.用于增强眼科决策的多模态推理智能体：一项初步的真实世界临床验证

Front Cell Dev Biol. 2025 Jul 23;13:1642539. doi: 10.3389/fcell.2025.1642539. eCollection 2025.

本文引用的文献

GPT-4o vs. Human Candidates: Performance Analysis in the Polish Final Dentistry Examination.GPT-4o与人类考生：波兰牙科最终考试中的表现分析

Cureus. 2024 Sep 6;16(9):e68813. doi: 10.7759/cureus.68813. eCollection 2024 Sep.

A Comparative Analysis of ChatGPT and Medical Faculty Graduates in Medical Specialization Exams: Uncovering the Potential of Artificial Intelligence in Medical Education.ChatGPT与医学专业毕业生在医学专科考试中的比较分析：揭示人工智能在医学教育中的潜力

Cureus. 2024 Aug 9;16(8):e66517. doi: 10.7759/cureus.66517. eCollection 2024 Aug.

Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery.ChatGPT对前交叉韧带手术常见问题回答的准确性评估

Knee. 2024 Dec;51:84-92. doi: 10.1016/j.knee.2024.08.014. Epub 2024 Sep 5.

Revolutionizing patient education with GPT-4o: a new approach to preventing surgical site infections in total hip arthroplasty.利用GPT-4o革新患者教育：全髋关节置换术中预防手术部位感染的新方法。

Int J Surg. 2025 Jan 1;111(1):1571-1575. doi: 10.1097/JS9.0000000000002023.

The latest version ChatGPT powered by GPT-4o: what will it bring to the medical field?由GPT-4o驱动的最新版本ChatGPT：它将给医学领域带来什么？

Int J Surg. 2024 Sep 1;110(9):6018-6019. doi: 10.1097/JS9.0000000000001754.

Transformative Potential of AI in Healthcare: Definitions, Applications, and Navigating the Ethical Landscape and Public Perspectives.人工智能在医疗保健领域的变革潜力：定义、应用以及应对伦理格局和公众观点

Healthcare (Basel). 2024 Jan 5;12(2):125. doi: 10.3390/healthcare12020125.

Let's chat about cervical cancer: Assessing the accuracy of ChatGPT responses to cervical cancer questions.让我们来聊聊宫颈癌：评估 ChatGPT 对宫颈癌问题回答的准确性。

Gynecol Oncol. 2023 Dec;179:164-168. doi: 10.1016/j.ygyno.2023.11.008. Epub 2023 Nov 21.

Assessing the Performance of Chat Generative Pretrained Transformer (ChatGPT) in Answering Andrology-Related Questions.评估聊天生成预训练变换器（ChatGPT）回答男科相关问题的性能。

Urol Res Pract. 2023 Nov;49(6):365-369. doi: 10.5152/tud.2023.23171.

Enhancing Patient Communication With Chat-GPT in Radiology: Evaluating the Efficacy and Readability of Answers to Common Imaging-Related Questions.利用Chat-GPT加强放射科与患者的沟通：评估常见影像相关问题答案的有效性和可读性

J Am Coll Radiol. 2024 Feb;21(2):353-359. doi: 10.1016/j.jacr.2023.09.011. Epub 2023 Oct 18.

Performance of ChatGPT in medical examinations: A systematic review and a meta-analysis.ChatGPT在医学考试中的表现：系统评价与荟萃分析。

BJOG. 2024 Feb;131(3):378-380. doi: 10.1111/1471-0528.17641. Epub 2023 Aug 21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能对尾骨痛常见问题的回答：评估GPT-4o表现的准确性和一致性。

Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.

作者信息

机构信息

出版信息

OBJECTIVES

MATERIALS AND METHODS

RESULTS

CONCLUSION

目的

材料与方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献