Aljamaan Fadi, Temsah Mohamad-Hani, Altamimi Ibraheem, Al-Eyadhy Ayman, Jamal Amr, Alhasan Khalid, Mesallam Tamer A, Farahat Mohamed, Malki Khalid H
College of Medicine, King Saud University, Riyadh, Saudi Arabia.
Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.
JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.
Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation.
The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations.
Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots.
Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001).
The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.
人工智能(AI)聊天机器人最近已被医疗保健从业者应用于医疗实践。有趣的是,人们发现这些AI聊天机器人的输出在内容和参考文献方面存在不同程度的幻觉。这种幻觉引发了对其输出及其应用的质疑。
我们研究的目的是提出一种参考文献幻觉评分(RHS),以评估AI聊天机器人引用的真实性。
用相同的10个医学问题挑战6个AI聊天机器人,每个问题要求提供10篇参考文献。RHS由6个书目项目以及参考文献与问题关键词的相关性组成。计算每个参考文献、问题以及问题类型(基础问题与复杂问题)的RHS。计算每个AI聊天机器人的平均RHS,并在不同类型的问题和AI聊天机器人之间进行比较。
Bard未能生成任何参考文献。ChatGPT 3.5和必应生成的RHS最高(得分=11),而Elicit和SciSpace生成的RHS最低(得分=1),Perplexity生成的RHS处于中等水平(得分=7)。在参考文献与问题关键词的相关性方面观察到的幻觉程度最高(308/500,61.6%),而在参考文献标题方面最低(169/500,33.8%)。ChatGPT和必应的RHS相当(β系数=-0.069;P=0.32),而Perplexity的RHS显著低于ChatGPT(β系数=-0.345;P<0.001)。当用情景或复杂格式的问题进行提示时,AI聊天机器人的RHS通常显著更高(β系数=0.486;P<0.001)。
RHS的差异凸显了需要一个强大的参考文献评估工具来提高AI聊天机器人的真实性。此外,这些差异突出了核实其输出和引用的重要性。Elicit和SciSpace的幻觉可以忽略不计,而ChatGPT和必应的幻觉程度严重。所提出的AI聊天机器人的RHS有助于正在进行的提高AI在医学研究中的总体可靠性的努力。