评估成人鼻窦炎指南：美国耳鼻咽喉头颈外科学会（AAO-HNS）与人工智能聊天机器人的比较分析

Assessing adult sinusitis guidelines: A comparative analysis of AAO-HNS and AI Chatbots.

作者信息

Edalati Shaun, Sharma Shiven, Guda Rahul, Vasan Vikram, Mohamed Shahed, Gidumal Sunder, Govindaraj Satish, Iloreta Alfred Marc

机构信息

Department of Otolaryngology-Head and Neck Surgery, Icahn School of Medicine at Mount Sinai, New York, NY, USA.

出版信息

Am J Otolaryngol. 2025 Jan-Feb;46(1):104563. doi: 10.1016/j.amjoto.2024.104563. Epub 2025 Jan 29.

DOI:10.1016/j.amjoto.2024.104563

PMID:39884919

Abstract

OBJECTIVE

To compare the guidelines offered by the American Academy of Otolaryngology-Head and Neck Surgery Foundation (AAO-HNS) on adult sinusitis to chatbots.

METHODS

ChatGPT-3.5, ChatGPT-4.0, Bard, and Llama 2 represent openly accessible large language model-based chatbots. Accuracy, over-conclusiveness, supplemental, and incompleteness of chatbot responses were compared to the AAO-HNS Adult sinusitis clinical guidelines.

RESULTS

12 guidelines consisting of 30 questions from the AAO-HNS were compared to 4 different chatbots. Adherence to AAO-HNS guidelines varied, with Llama 2 providing 80 % accurate responses, BARD 83.3 %, ChatGPT-4.0 80 %, and ChatGPT-3.5 73.3 %. Over-conclusive responses were minimal, with only one instance each from Llama 2 and ChatGPT-4.0. However, rates of incomplete responses varied, with Llama 2 exhibiting the highest at 40 %, followed by ChatGPT-4.0 at 33.3 %, BARD at 23.3 %, and ChatGPT-3.5 at 36.7 %. Fisher's Exact Test analysis revealed significant deviations from the guideline standard, with less accuracy (p = 0.012 for Llama 2, p = 0.026 for BARD, p = 0.012 for ChatGPT-4.0, p = 0.002 for ChatGPT-3.5), inclusion of supplemental data (p < 0.001 for all), and less completeness (p < 0.01 for all) across all chatbots, indicating potential areas for enhancement in their performance.

CONCLUSION

Although AI chatbots like Llama 2, Bard, and ChatGPT exhibit potential in sharing health-related information, their present performance in responding to clinical concerns concerning adult rhinosinusitis is not up to par with recognized clinical criteria. Future revisions should focus on addressing these shortcomings and placing an emphasis on accuracy, completeness, and conformity with evidence-based practices.

摘要

目的

比较美国耳鼻咽喉-头颈外科学会基金会（AAO-HNS）提供的成人鼻窦炎指南与聊天机器人的情况。

方法

ChatGPT-3.5、ChatGPT-4.0、Bard和Llama 2代表基于公开可用的大语言模型的聊天机器人。将聊天机器人回复的准确性、过度结论性、补充性和不完整性与AAO-HNS成人鼻窦炎临床指南进行比较。

结果

将由AAO-HNS的30个问题组成的12条指南与4种不同的聊天机器人进行了比较。各聊天机器人对AAO-HNS指南的遵循情况各不相同，Llama 2的回答准确率为80%，BARD为83.3%，ChatGPT-4.0为80%，ChatGPT-3.5为73.3%。过度结论性的回复很少，Llama 2和ChatGPT-4.0各只有一个实例。然而，不完整回复的比例各不相同，Llama 2最高，为40%，其次是ChatGPT-4.0，为33.3%，BARD为23.3%，ChatGPT-3.5为36.7%。Fisher精确检验分析显示，所有聊天机器人在准确性（Llama 2的p = 0.012，BARD的p = 0.026，ChatGPT-4.0的p = 0.012，ChatGPT-3.5的p = 0.002）、补充数据的纳入（所有p < 0.001）和完整性（所有p < 0.01）方面均与指南标准存在显著偏差，表明它们在性能上有潜在的改进空间。

结论

尽管像Llama 2、Bard和ChatGPT这样的人工智能聊天机器人在分享健康相关信息方面具有潜力，但它们目前在回答有关成人鼻-鼻窦炎的临床问题时的表现未达到公认的临床标准。未来的修订应侧重于解决这些缺点，并强调准确性、完整性以及与循证实践的一致性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

评估成人鼻窦炎指南：美国耳鼻咽喉头颈外科学会（AAO-HNS）与人工智能聊天机器人的比较分析

Assessing adult sinusitis guidelines: A comparative analysis of AAO-HNS and AI Chatbots.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

评估成人鼻窦炎指南：美国耳鼻咽喉头颈外科学会（AAO-HNS）与人工智能聊天机器人的比较分析

Assessing adult sinusitis guidelines: A comparative analysis of AAO-HNS and AI Chatbots.

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献