Suppr超能文献

评估人工智能驱动的急性肝衰竭问题回答:准确性、清晰度和相关性的比较分析

Evaluating Artificial Intelligence-Driven Responses to Acute Liver Failure Queries: A Comparative Analysis Across Accuracy, Clarity, and Relevance.

作者信息

Malik Sheza, Frey Lewis J, Gutman Jason, Mushtaq Asim, Warraich Fatima, Qureshi Kamran

机构信息

Internal Medicine, Rochester General Hospital, Rochester, New York, USA.

Ralph H. Johnson Veterans Affairs Medical Center, Charleston, South Carolina, USA.

出版信息

Am J Gastroenterol. 2024 Dec 17. doi: 10.14309/ajg.0000000000003255.

Abstract

INTRODUCTION

Recent advancements in artificial intelligence (AI), particularly through the deployment of large language models (LLMs), have profoundly impacted healthcare. This study assesses 5 LLMs-ChatGPT 3.5, ChatGPT 4, BARD, CLAUDE, and COPILOT-on their response accuracy, clarity, and relevance to queries concerning acute liver failure (ALF). We subsequently compare these results with ChatGPT4 enhanced with retrieval augmented generation (RAG) technology.

METHODS

Based on real-world clinical use and the American College of Gastroenterology guidelines, we formulated 16 ALF questions or clinical scenarios to explore LLMs' ability to handle different clinical questions. Using the "New Chat" functionality, each query was processed individually across the models to reduce any bias. Additionally, we employed the RAG functionality of GPT-4, which integrates external sources as references to ground the results. All responses were evaluated on a Likert scale from 1 to 5 for accuracy, clarity, and relevance by 4 independent investigators to ensure impartiality.

RESULTS

ChatGPT 4, augmented with RAG, demonstrated superior performance compared with others, consistently scoring the highest (4.70, 4.89, 4.78) across all 3 domains. ChatGPT 4 exhibited notable proficiency, with scores of 3.67 in accuracy, 4.04 in clarity, and 4.01 in relevance. In contrast, CLAUDE achieved 3.04 in clarity, 3.6 in relevance, and 3.65 in accuracy. Meanwhile, BARD and COPILOT exhibited lower performance levels; BARD recorded scores of 2.01 in accuracy and 3.03 in relevance, while COPILOT obtained 2.26 in accuracy and 3.12 in relevance.

DISCUSSION

The study highlights Chat GPT 4 +RAG's superior performance compared with other LLMs. By integrating RAG with LLMs, the system combines generative language skills with accurate, up-to-date information. This improves response clarity, relevance, and accuracy, making them more effective in healthcare. However, AI models must continually evolve and align with medical practices for successful healthcare integration.

摘要

引言

人工智能(AI)的最新进展,特别是通过大语言模型(LLM)的部署,对医疗保健产生了深远影响。本研究评估了5个大语言模型——ChatGPT 3.5、ChatGPT 4、BARD、CLAUDE和COPILOT——在回答有关急性肝衰竭(ALF)问题时的准确性、清晰度和相关性。随后,我们将这些结果与采用检索增强生成(RAG)技术增强的ChatGPT4进行比较。

方法

基于实际临床应用和美国胃肠病学会指南,我们制定了16个急性肝衰竭问题或临床场景,以探索大语言模型处理不同临床问题的能力。使用“新聊天”功能,每个查询在各个模型上单独处理,以减少任何偏差。此外,我们采用了GPT-4的RAG功能,该功能整合外部来源作为参考依据来支撑结果。4名独立研究人员对所有回答在准确性、清晰度和相关性方面按照1至5的李克特量表进行评估,以确保公正性。

结果

采用RAG增强的ChatGPT 4与其他模型相比表现更优,在所有三个领域的得分始终最高(4.70、4.89、4.78)。ChatGPT 4表现出显著的熟练度,准确性得分3.67,清晰度得分4.04,相关性得分4.01。相比之下,CLAUDE的清晰度得分为3.04,相关性得分为3.6,准确性得分为3.65。同时,BARD和COPILOT表现出较低的性能水平;BARD的准确性得分为2.01,相关性得分为3.03,而COPILOT的准确性得分为2.26,相关性得分为3.12。

讨论

该研究突出了Chat GPT 4 +RAG与其他大语言模型相比的卓越性能。通过将RAG与大语言模型集成,系统将生成式语言技能与准确、最新的信息相结合。这提高了回答的清晰度、相关性和准确性,使其在医疗保健中更有效。然而,人工智能模型必须不断发展并与医疗实践保持一致,才能成功融入医疗保健领域。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验