Edalati Shaun, Sharma Shiven, Guda Rahul, Vasan Vikram, Mohamed Shahed, Gidumal Sunder, Govindaraj Satish, Iloreta Alfred Marc
Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
Am J Otolaryngol. 2025 Jan-Feb;46(1):104563. doi: 10.1016/j.amjoto.2024.104563. Epub 2025 Jan 29.
To compare the guidelines offered by the American Academy of Otolaryngology-Head and Neck Surgery Foundation (AAO-HNS) on adult sinusitis to chatbots.
ChatGPT-3.5, ChatGPT-4.0, Bard, and Llama 2 represent openly accessible large language model-based chatbots. Accuracy, over-conclusiveness, supplemental, and incompleteness of chatbot responses were compared to the AAO-HNS Adult sinusitis clinical guidelines.
12 guidelines consisting of 30 questions from the AAO-HNS were compared to 4 different chatbots. Adherence to AAO-HNS guidelines varied, with Llama 2 providing 80 % accurate responses, BARD 83.3 %, ChatGPT-4.0 80 %, and ChatGPT-3.5 73.3 %. Over-conclusive responses were minimal, with only one instance each from Llama 2 and ChatGPT-4.0. However, rates of incomplete responses varied, with Llama 2 exhibiting the highest at 40 %, followed by ChatGPT-4.0 at 33.3 %, BARD at 23.3 %, and ChatGPT-3.5 at 36.7 %. Fisher's Exact Test analysis revealed significant deviations from the guideline standard, with less accuracy (p = 0.012 for Llama 2, p = 0.026 for BARD, p = 0.012 for ChatGPT-4.0, p = 0.002 for ChatGPT-3.5), inclusion of supplemental data (p < 0.001 for all), and less completeness (p < 0.01 for all) across all chatbots, indicating potential areas for enhancement in their performance.
Although AI chatbots like Llama 2, Bard, and ChatGPT exhibit potential in sharing health-related information, their present performance in responding to clinical concerns concerning adult rhinosinusitis is not up to par with recognized clinical criteria. Future revisions should focus on addressing these shortcomings and placing an emphasis on accuracy, completeness, and conformity with evidence-based practices.
比较美国耳鼻咽喉-头颈外科学会基金会(AAO-HNS)提供的成人鼻窦炎指南与聊天机器人的情况。
ChatGPT-3.5、ChatGPT-4.0、Bard和Llama 2代表基于公开可用的大语言模型的聊天机器人。将聊天机器人回复的准确性、过度结论性、补充性和不完整性与AAO-HNS成人鼻窦炎临床指南进行比较。
将由AAO-HNS的30个问题组成的12条指南与4种不同的聊天机器人进行了比较。各聊天机器人对AAO-HNS指南的遵循情况各不相同,Llama 2的回答准确率为80%,BARD为83.3%,ChatGPT-4.0为80%,ChatGPT-3.5为73.3%。过度结论性的回复很少,Llama 2和ChatGPT-4.0各只有一个实例。然而,不完整回复的比例各不相同,Llama 2最高,为40%,其次是ChatGPT-4.0,为33.3%,BARD为23.3%,ChatGPT-3.5为36.7%。Fisher精确检验分析显示,所有聊天机器人在准确性(Llama 2的p = 0.012,BARD的p = 0.026,ChatGPT-4.0的p = 0.012,ChatGPT-3.5的p = 0.002)、补充数据的纳入(所有p < 0.001)和完整性(所有p < 0.01)方面均与指南标准存在显著偏差,表明它们在性能上有潜在的改进空间。
尽管像Llama 2、Bard和ChatGPT这样的人工智能聊天机器人在分享健康相关信息方面具有潜力,但它们目前在回答有关成人鼻-鼻窦炎的临床问题时的表现未达到公认的临床标准。未来的修订应侧重于解决这些缺点,并强调准确性、完整性以及与循证实践的一致性。