Durgut Osman, Dikici Oğuzhan
Department of Otorhinolaryngology, Health Science University, Bursa City Hospital, T.C. Sağlık Bakanlığı Bursa Şehir Hastanesi Doğanköy Mahallesi, Nilüfer /Bursa, 16110, Turkey.
Eur Arch Otorhinolaryngol. 2025 Aug 23. doi: 10.1007/s00405-025-09630-3.
This study aims to evaluate the accuracy of ChatGPT-4.0 in providing information on tympanostomy tube indications in children, comparing its responses with established clinical guidelines and examining its ability to update itself over time.
Sixteen clinical scenarios from the American Academy of Otolaryngology-Head and Neck Surgery Foundation (AAO-HNSF) guidelines were assessed using 18 specific questions. Responses were evaluated by two otolaryngologists and ChatGPT itself. The final validation was conducted by a senior otolaryngologist. Cohen's Kappa analysis was performed to assess inter-rater reliability.
ChatGPT-4.0 correctly answered 15.5 out of 16 scenarios (96.8%). The second-stage question of scenario 7 was evaluated as incorrect. When current literature was referenced, all responses reached 100% accuracy. Among the correct answers, 4 scenarios were not fully aligned with the guidelines. However, when responses were based on current literature, all of these answers were found to be fully compliant. The agreement among the three evaluators was perfect, as confirmed by Cohen's Kappa analysis. Despite using an updated version (ChatGPT-4.0) and over a year having passed, it was observed that ChatGPT-3.5 answered a previously incorrect scenario in the same incorrect manner. This suggests that the model may have limited capacity for self-updating over time. These findings are consistent with previous research, indicating that ChatGPT provides highly accurate responses regarding tympanostomy tube placement and largely aligns with existing guidelines.
ChatGPT-4.0 demonstrates high accuracy in providing guideline-based medical information, but its ability to update itself over time appears to be limited. However, when prompted to reference current literature, its accuracy improves significantly. These findings highlight the importance of structured prompting and critical evaluation of AI-generated medical guidance.
本研究旨在评估ChatGPT-4.0在提供儿童鼓膜置管适应症信息方面的准确性,将其回答与既定临床指南进行比较,并考察其随时间自我更新的能力。
使用18个特定问题评估了美国耳鼻咽喉头颈外科学会基金会(AAO-HNSF)指南中的16个临床场景。两名耳鼻喉科医生和ChatGPT自身对回答进行了评估。最终验证由一名资深耳鼻喉科医生进行。进行了Cohen's Kappa分析以评估评分者间的可靠性。
ChatGPT-4.0在16个场景中正确回答了15.5个(96.8%)。场景7的第二阶段问题被评估为错误。当参考当前文献时,所有回答的准确率达到100%。在正确答案中,有4个场景与指南不完全一致。然而,当回答基于当前文献时,发现所有这些答案都完全符合要求。Cohen's Kappa分析证实,三位评估者之间的一致性极佳。尽管使用了更新版本(ChatGPT-4.0)且已过去一年多时间,但观察到ChatGPT-3.5以相同的错误方式回答了一个之前错误的场景。这表明该模型随时间自我更新的能力可能有限。这些发现与之前的研究一致,表明ChatGPT在鼓膜置管方面提供了高度准确的回答,并且在很大程度上与现有指南一致。
ChatGPT-4.0在提供基于指南的医学信息方面表现出高准确性,但其随时间自我更新的能力似乎有限。然而,当被提示参考当前文献时,其准确性显著提高。这些发现凸显了对人工智能生成的医学指导进行结构化提示和批判性评估的重要性。