Suppr超能文献

医学人工智能聊天机器人的参考幻觉评分:开发与可用性研究。

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.

作者信息

Aljamaan Fadi, Temsah Mohamad-Hani, Altamimi Ibraheem, Al-Eyadhy Ayman, Jamal Amr, Alhasan Khalid, Mesallam Tamer A, Farahat Mohamed, Malki Khalid H

机构信息

College of Medicine, King Saud University, Riyadh, Saudi Arabia.

Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.

出版信息

JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.

Abstract

BACKGROUND

Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.

OBJECTIVE

The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations.

METHODS

Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.

RESULTS

Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001).

CONCLUSIONS

The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.

摘要

背景

人工智能(AI)聊天机器人最近已被医疗保健从业者应用于医疗实践。有趣的是,人们发现这些AI聊天机器人的输出在内容和参考文献方面存在不同程度的幻觉。这种幻觉引发了对其输出及其应用的质疑。

目的

我们研究的目的是提出一种参考文献幻觉评分(RHS),以评估AI聊天机器人引用的真实性。

方法

用相同的10个医学问题挑战6个AI聊天机器人,每个问题要求提供10篇参考文献。RHS由6个书目项目以及参考文献与问题关键词的相关性组成。计算每个参考文献、问题以及问题类型(基础问题与复杂问题)的RHS。计算每个AI聊天机器人的平均RHS,并在不同类型的问题和AI聊天机器人之间进行比较。

结果

Bard未能生成任何参考文献。ChatGPT 3.5和必应生成的RHS最高(得分=11),而Elicit和SciSpace生成的RHS最低(得分=1),Perplexity生成的RHS处于中等水平(得分=7)。在参考文献与问题关键词的相关性方面观察到的幻觉程度最高(308/500,61.6%),而在参考文献标题方面最低(169/500,33.8%)。ChatGPT和必应的RHS相当(β系数=-0.069;P=0.32),而Perplexity的RHS显著低于ChatGPT(β系数=-0.345;P<0.001)。当用情景或复杂格式的问题进行提示时,AI聊天机器人的RHS通常显著更高(β系数=0.486;P<0.001)。

结论

RHS的差异凸显了需要一个强大的参考文献评估工具来提高AI聊天机器人的真实性。此外,这些差异突出了核实其输出和引用的重要性。Elicit和SciSpace的幻觉可以忽略不计,而ChatGPT和必应的幻觉程度严重。所提出的AI聊天机器人的RHS有助于正在进行的提高AI在医学研究中的总体可靠性的努力。

相似文献

引用本文的文献

本文引用的文献

10
Artificial Hallucinations by Google Bard: Think Before You Leap.谷歌巴德生成的人工幻觉:三思而后行。
Cureus. 2023 Aug 10;15(8):e43313. doi: 10.7759/cureus.43313. eCollection 2023 Aug.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验