Hack Sholem, Alsleibi Shibli, Saleh Naseem, Alon Eran E, Rabinovics Naomi, Remer Eric
St. Georges University London School of Medicine, Program Delivered by University of Nicosia at The Chaim Sheba Medical Center, Ramat Gan, Israel.
Department of Otolaryngology, Sheba Medical Center, Ramat Gan, Israel.
Eur Arch Otorhinolaryngol. 2025 Apr 30. doi: 10.1007/s00405-025-09433-6.
To evaluate the reliability and accuracy of Large Language Models in answering patient Frequently Asked Questions about adult neck masses.
Twenty-four questions from the American Academy of Otolaryngology-Head and Neck Surgery were presented to ChatGPT, Claude, and Gemini. Five independent otolaryngologists evaluated responses using six criteria: accuracy, extensiveness, misleading information, resource quality, guideline citations, and overall reliability. Statistical analysis used Fisher's exact tests and Fleiss' Kappa.
All models showed high reliability (91.7-100%). Paid GPT and Gemini achieved highest accuracy (95.8%). Extensiveness varied significantly (p = 0.012), with Gemini scoring lowest (62.5%). Resource quality ranged from 58.3% (Claude) to 100% (Paid GPT). Guideline citations were highest for GPT models (50%) and lowest for Gemini (16.7%). Misleading information was rare (0-16.7%). Inter-rater reliability was near-perfect across five reviewers (κ = 0.95).
Large Language Models demonstrate high reliability and accuracy for neck mass patient education, with paid versions showing marginally better performance. While promising as educational tools, variable guideline adherence and occasional misinformation suggest they should complement rather than replace professional medical advice.
评估大语言模型在回答成人颈部肿块患者常见问题方面的可靠性和准确性。
向ChatGPT、Claude和Gemini提出了美国耳鼻咽喉头颈外科学会的24个问题。五位独立的耳鼻喉科医生使用六个标准评估回答:准确性、全面性、误导性信息、资源质量、指南引用和整体可靠性。统计分析采用Fisher精确检验和Fleiss卡方检验。
所有模型均显示出高可靠性(91.7%-100%)。付费版GPT和Gemini的准确性最高(95.8%)。全面性差异显著(p = 0.012),Gemini得分最低(62.5%)。资源质量从58.3%(Claude)到100%(付费版GPT)不等。指南引用率在GPT模型中最高(50%),在Gemini中最低(16.7%)。误导性信息很少见(0%-16.7%)。五位评审员之间的评分者间可靠性近乎完美(κ = 0.95)。
大语言模型在颈部肿块患者教育方面显示出高可靠性和准确性,付费版表现略优。虽然作为教育工具很有前景,但指南遵循情况不一以及偶尔出现错误信息表明它们应作为专业医疗建议的补充,而非替代。