文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.

作者信息

Aljamaan Fadi, Temsah Mohamad-Hani, Altamimi Ibraheem, Al-Eyadhy Ayman, Jamal Amr, Alhasan Khalid, Mesallam Tamer A, Farahat Mohamed, Malki Khalid H

机构信息

College of Medicine, King Saud University, Riyadh, Saudi Arabia.

Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, Riyadh, Saudi Arabia.

出版信息

JMIR Med Inform. 2024 Jul 31;12:e54345. doi: 10.2196/54345.


DOI:10.2196/54345
PMID:39083799
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11325115/
Abstract

BACKGROUND: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. OBJECTIVE: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots' citations. METHODS: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference's relevance to prompts' keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. RESULTS: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=-0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=-0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001). CONCLUSIONS: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots' RHS could contribute to ongoing efforts to enhance AI's general reliability in medical research.

摘要

相似文献

[1]
Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study.

JMIR Med Inform. 2024-7-31

[2]
Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.

Cureus. 2024-8-28

[3]
Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.

Cureus. 2024-3-23

[4]
Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.

Medicine (Baltimore). 2024-8-16

[5]
Beyond the Hype-The Actual Role and Risks of AI in Today's Medical Practice: Comparative-Approach Study.

JMIR AI. 2024-1-22

[6]
Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer.

JAMA Oncol. 2023-10-1

[7]
Comparative analysis of artificial intelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance.

Fr J Urol. 2024-7

[8]
Evaluation and Comparison of Ophthalmic Scientific Abstracts and References by Current Artificial Intelligence Chatbots.

JAMA Ophthalmol. 2023-9-1

[9]
Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study.

JMIR Med Educ. 2023-11-1

[10]
Performance of ChatGPT-4 and Bard chatbots in responding to common patient questions on prostate cancer Lu-PSMA-617 therapy.

Front Oncol. 2024-7-12

引用本文的文献

[1]
AI in conjunctivitis research: assessing ChatGPT and DeepSeek for etiology, intervention, and citation integrity via hallucination rate analysis.

Front Artif Intell. 2025-8-20

[2]
Advances in Periodontal Diagnostics: Application of MultiModal Language Models in Visual Interpretation of Panoramic Radiographs.

Diagnostics (Basel). 2025-7-23

[3]
Performance of AI Models vs. Orthopedic Residents in Turkish Specialty Training Development Exams in Orthopedics.

Sisli Etfal Hastan Tip Bul. 2025-2-7

[4]
Large Language Model Synergy for Ensemble Learning in Medical Question Answering: Design and Evaluation Study.

J Med Internet Res. 2025-7-14

[5]
The Emergence of Applied Artificial Intelligence in the Realm of Value Based Musculoskeletal Care.

Curr Rev Musculoskelet Med. 2025-6-14

[6]
Comparing orthodontic pre-treatment information provided by large language models.

BMC Oral Health. 2025-5-28

[7]
From prompt to platform: an agentic AI workflow for healthcare simulation scenario design.

Adv Simul (Lond). 2025-5-16

[8]
Building and Beta-Testing Be Well Buddy Chatbot, a Secure, Credible and Trustworthy AI Chatbot That Will Not Misinform, Hallucinate or Stigmatize Substance Use Disorder: Development and Usability Study.

JMIR Hum Factors. 2025-5-7

[9]
Public knowledge of food poisoning, risk perception and food safety practices in Saudi Arabia: A cross-sectional survey following foodborne botulism outbreak.

Medicine (Baltimore). 2025-4-11

[10]
Authors' Reply: Citation Accuracy Challenges Posed by Large Language Models.

JMIR Med Educ. 2025-4-2

本文引用的文献

[1]
Art or Artifact: Evaluating the Accuracy, Appeal, and Educational Value of AI-Generated Imagery in DALL·E 3 for Illustrating Congenital Heart Diseases.

J Med Syst. 2024-5-23

[2]
The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians.

Intern Med J. 2024-5

[3]
Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis.

Surg Endosc. 2024-5

[4]
Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study.

JMIR Med Educ. 2023-11-1

[5]
A SWOT (Strengths, Weaknesses, Opportunities, and Threats) Analysis of ChatGPT in the Medical Literature: Concise Review.

J Med Internet Res. 2023-11-16

[6]
Opportunities, Challenges, and Future Directions of Generative Artificial Intelligence in Medical Education: Scoping Review.

JMIR Med Educ. 2023-10-20

[7]
ChatGPT Surpasses 1000 Publications on PubMed: Envisioning the Road Ahead.

Cureus. 2023-9-6

[8]
The use of artificial intelligence to improve the scientific writing of non-native english speakers.

Rev Assoc Med Bras (1992). 2023

[9]
The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation.

JMIR Med Educ. 2023-9-14

[10]
Artificial Hallucinations by Google Bard: Think Before You Leap.

Cureus. 2023-8-10

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索