Suppr超能文献

大型语言模型在牙髓病学临床决策中的性能评估。

Evaluation of the performance of large language models in clinical decision-making in endodontics.

作者信息

Özbay Yağız, Erdoğan Deniz, Dinçer Gözde Akbal

机构信息

Department of Endodontics, Faculty of Dentistry, Karabük University, Karabük, Türkiye.

Private Dentist, Ankara, Türkiye.

出版信息

BMC Oral Health. 2025 Apr 28;25(1):648. doi: 10.1186/s12903-025-06050-x.

Abstract

BACKGROUND

Artificial intelligence (AI) chatbots are excellent at generating language. The growing use of generative AI large language models (LLMs) in healthcare and dentistry, including endodontics, raises questions about their accuracy. The potential of LLMs to assist clinicians' decision-making processes in endodontics is worth evaluating. This study aims to comparatively evaluate the answers provided by Google Bard, ChatGPT-3.5, and ChatGPT-4 to clinically relevant questions from the field of Endodontics.

METHODS

40 open-ended questions covering different areas of endodontics were prepared and were introduced to Google Bard, ChatGPT-3.5, and ChatGPT-4. Validity of the questions was evaluated using the Lawshe Content Validity Index. Two experienced endodontists, blinded to the chatbots, evaluated the answers using a 3-point Likert scale. All responses deemed to contain factually wrong information were noted and a misinformation rate for each LLM was calculated (number of answers containing wrong information/total number of questions). The One-way analysis of variance and Post Hoc Tukey test were used to analyze the data and significance was considered to be p < 0.05.

RESULTS

ChatGPT-4 demonstrated the highest score and the lowest misinformation rate (P = 0.008) followed by ChatGPT-3.5 and Google Bard respectively. The difference between ChatGPT-4 and Google Bard was statistically significant (P = 0.004).

CONCLUSION

ChatGPT-4 provided more accurate and informative information in endodontics. However, all LLMs produced varying levels of incomplete or incorrect answers.

摘要

背景

人工智能(AI)聊天机器人在生成语言方面表现出色。生成式人工智能大语言模型(LLMs)在医疗保健和牙科领域,包括牙髓病学中的使用日益增加,这引发了对其准确性的质疑。大语言模型在协助牙髓病学临床医生决策过程中的潜力值得评估。本研究旨在比较评估谷歌巴德(Google Bard)、ChatGPT-3.5和ChatGPT-4对牙髓病学领域临床相关问题的回答。

方法

准备了40个涵盖牙髓病学不同领域的开放式问题,并将其输入谷歌巴德、ChatGPT-3.5和ChatGPT-4。使用劳什内容效度指数评估问题的有效性。两名经验丰富的牙髓病医生在对聊天机器人不知情的情况下,使用3点李克特量表对答案进行评估。记录所有被认为包含事实错误信息的回答,并计算每个大语言模型的错误信息率(包含错误信息的答案数量/问题总数)。采用单因素方差分析和事后 Tukey 检验对数据进行分析,显著性水平设定为 p < 0.05。

结果

ChatGPT-4得分最高,错误信息率最低(P = 0.008),其次分别是ChatGPT-3.5和谷歌巴德。ChatGPT-4与谷歌巴德之间的差异具有统计学意义(P = 0.004)。

结论

ChatGPT-4在牙髓病学方面提供了更准确、信息更丰富的信息。然而,所有大语言模型都产生了不同程度的不完整或不正确答案。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/608b/12039063/55319a09713f/12903_2025_6050_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验