Boscolo-Rizzo Paolo, Marcuzzo Alberto Vito, Lazzarin Chiara, Giudici Fabiola, Polesel Jerry, Stellin Marco, Pettorelli Andrea, Spinato Giacomo, Ottaviano Giancarlo, Ferrari Marco, Borsetto Daniele, Zucchini Simone, Trabalzini Franco, Sia Egidio, Gardenal Nicoletta, Baruca Roberto, Fortunati Alfonso, Vaira Luigi Angelo, Tirelli Giancarlo
Department of Medical, Surgical and Health Sciences, Section of Otolaryngology, University of Trieste, Trieste, Italy.
Unit of Cancer Epidemiology, Centro di Riferimento Oncologico di Aviano (CRO) Istituto di Ricovero e Cura a Carattere Scientifico (IRCCS), Aviano, Italy.
Clin Otolaryngol. 2025 Mar;50(2):330-335. doi: 10.1111/coa.14261. Epub 2024 Dec 4.
Artificial Intelligences (AIs) are changing the way information is accessed and consumed globally. This study aims to evaluate the information quality provided by AIs ChatGPT4 and Claude2 concerning reconstructive surgery for head and neck cancer.
Thirty questions on reconstructive surgery for head and neck cancer were directed to both AIs and 16 head and neck surgeons assessed the responses using the QAMAI questionnaire. A 5-point Likert scale was used to assess accuracy, clarity, relevance, completeness, sources, and usefulness. Questions were categorised into those suitable for patients (group 1) and those for surgeons (group 2). AI responses were compared using t-Student and McNemar tests. Surgeon score agreement was measured with intraclass correlation coefficient, and readability was assessed with Flesch-Kincaid Grade Level (FKGL).
ChatGPT4 and Claude2 had similar overall mean scores of accuracy, clarity, relevance, completeness and usefulness, while Claude2 outperformed ChatGPT4 in sources (110.0 vs. 92.1, p < 0.001). Considering the group 2, Claude2 showed significantly lower accuracy and completeness scores compared to ChatGPT4 (p = 0.003 and p = 0.002, respectively). Regarding readability, ChatGPT4 presented lower complexity than Claude2 (FKGL mean score 4.57 vs. 6.05, p < 0.001) requiring an easy-fairly easy English in 93% of cases.
Our findings indicate that neither chatbot exhibits a decisive superiority in all aspects. Nonetheless, ChatGPT4 demonstrates greater accuracy and comprehensiveness for specific types of questions and the simpler language used may aid patient inquiries. However, many evaluators disagree with chatbot information, highlighting that AI systems cannot serve as a substitute for advice from medical professionals.
人工智能正在改变全球获取和使用信息的方式。本研究旨在评估人工智能ChatGPT4和Claude2提供的关于头颈癌重建手术的信息质量。
向这两个人工智能提出了30个关于头颈癌重建手术的问题,16名头颈外科医生使用QAMAI问卷评估了回答。使用5点李克特量表评估准确性、清晰度、相关性、完整性、来源和有用性。问题分为适合患者的问题(第1组)和适合外科医生的问题(第2组)。使用t检验和McNemar检验比较人工智能的回答。使用组内相关系数测量外科医生评分的一致性,并使用弗莱施-金凯德年级水平(FKGL)评估可读性。
ChatGPT4和Claude2在准确性、清晰度、相关性、完整性和有用性方面的总体平均得分相似,而Claude2在来源方面表现优于ChatGPT4(110.0对92.1,p<0.001)。考虑第2组,与ChatGPT4相比,Claude2的准确性和完整性得分显著更低(分别为p=0.003和p=0.002)。关于可读性,ChatGPT4的复杂性低于Claude2(FKGL平均得分4.57对6.05,p<0.001),在93%的情况下需要简单-相当简单的英语。
我们的研究结果表明,这两个聊天机器人在所有方面都没有表现出决定性的优势。尽管如此,ChatGPT4在特定类型的问题上表现出更高的准确性和全面性,并且使用的语言更简单可能有助于患者咨询。然而,许多评估者不同意聊天机器人提供的信息,强调人工智能系统不能替代医学专业人员的建议。