人工智能聊天机器人在牙髓和根尖周诊断中的比较准确性：一项横断面研究。

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study.

机构信息

Postgraduate Program in Clinical Dentistry, University Center of Pará (CESUPA), Belém, Pará, Brazil.

Center for Health Sciences, Pontifical Catholic University of Campinas (PUC-Campinas), Postgraduate Program in Health Sciences, Campinas, São Paulo, Brazil.

出版信息

Comput Biol Med. 2024 Dec;183:109332. doi: 10.1016/j.compbiomed.2024.109332. Epub 2024 Oct 30.

DOI:10.1016/j.compbiomed.2024.109332

PMID:39471663

Abstract

OBJECTIVES

This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.

METHODS

In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.

RESULTS

Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001).

CONCLUSIONS

Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.

摘要

目的

本研究旨在评估四个人工智能聊天机器人在虚构的牙髓和根尖周疾病病例中的诊断准确性和治疗建议性能。此外，还研究了响应的一致性以及文本顺序和语言对聊天机器人性能的影响。

方法

在这项横断面比较研究中，创建了 11 个代表各种牙髓和根尖周病变的病例。这些病例以葡萄牙语和英语两种语言呈现给四个聊天机器人（ChatGPT 3.5、ChatGPT 4.0、Bard 和 Bing），信息顺序不同（症状和体征先出现或影像学数据先出现）。统计分析包括 Kruskal-Wallis 检验、Dwass-Steel-Critchlow-Fligner 两两比较、简单逻辑回归和二项式检验。

结果

Bing 和 ChatGPT 4.0 的诊断准确率最高（分别为 86.4%和 85.3%），明显优于 ChatGPT 3.5（46.5%）和 Bard（28.6%）（p<0.001）。在治疗建议方面，ChatGPT 4.0、Bing 和 ChatGPT 3.5 的表现相似（分别为 94.4%、93.2%和 86.3%），而 Bard 的准确率明显较低（75%，p<0.001）。Bard 和 Bing 的诊断和治疗准确性之间没有显著关联，但 ChatGPT 3.5 和 ChatGPT 4.0 之间存在正相关（p<0.05）。总体一致性率为 98.29%，文本顺序或语言没有显著差异。以葡萄牙语呈现的病例比以英语呈现的病例引起的额外信息请求明显更多（33.5%比 10.2%；p<0.001），且葡萄牙语的信息相关性更高（29.5%比 8.5%；p<0.001）。

结论

Bing 和 ChatGPT 4.0 表现出较高的诊断准确性，而 Bard 在诊断和治疗建议方面的准确性最低。然而，这些工具的临床应用需要牙医进行批判性解释，因为聊天机器人的响应并不总是可靠的。

相似文献

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study.人工智能聊天机器人在牙髓和根尖周诊断中的比较准确性：一项横断面研究。

Comput Biol Med. 2024 Dec;183:109332. doi: 10.1016/j.compbiomed.2024.109332. Epub 2024 Oct 30.

Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.人工智能聊天机器人对改编自患者手册的青光眼问题的回答情况。

Cureus. 2024 Mar 23;16(3):e56766. doi: 10.7759/cureus.56766. eCollection 2024 Mar.

Quantitative Comparison of Chatbots on Common Rhinology Pathologies.常见鼻科学病症的聊天机器人定量比较。

Laryngoscope. 2024 Oct;134(10):4225-4231. doi: 10.1002/lary.31470. Epub 2024 Apr 26.

Performance of AI-powered chatbots in diagnosing acute pulmonary thromboembolism from given clinical vignettes.人工智能聊天机器人在基于给定临床病案诊断急性肺血栓栓塞症方面的性能。

Acute Med. 2024;23(2):66-74.

Comparative analysis of artificial intelligence chatbot recommendations for urolithiasis management: A study of EAU guideline compliance.人工智能聊天机器人对尿石症管理建议的比较分析：一项关于欧洲泌尿外科学会指南依从性的研究

Fr J Urol. 2024 Jul;34(7-8):102666. doi: 10.1016/j.fjurol.2024.102666. Epub 2024 Jun 5.

Performance of Large Language Models (ChatGPT, Bing Search, and Google Bard) in Solving Case Vignettes in Physiology.大语言模型（ChatGPT、必应搜索和谷歌巴德）在解决生理学病例 vignettes 中的表现。

Cureus. 2023 Aug 4;15(8):e42972. doi: 10.7759/cureus.42972. eCollection 2023 Aug.

Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性：实验性对比研究。

J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.人工智能聊天机器人作为牙髓学公共信息源的有效性和可靠性。

Int Endod J. 2024 Mar;57(3):305-314. doi: 10.1111/iej.14014. Epub 2023 Dec 20.

Evaluating the Sensitivity, Specificity, and Accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard Against Conventional Drug-Drug Interactions Clinical Tools.评估ChatGPT-3.5、ChatGPT-4、必应人工智能和巴德相对于传统药物相互作用临床工具的敏感性、特异性和准确性。

Drug Healthc Patient Saf. 2023 Sep 20;15:137-147. doi: 10.2147/DHPS.S425858. eCollection 2023.

引用本文的文献

Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.大型语言模型在欧洲普通外科医师资格考试中表现不佳：与专家及外科住院医师的比较研究

BMC Med Educ. 2025 Aug 23;25(1):1193. doi: 10.1186/s12909-025-07856-7.

Comparative Performance of Chatbots in Endodontic Clinical Decision Support: A 4-Day Accuracy and Consistency Study.聊天机器人在牙髓病临床决策支持中的比较性能：一项为期4天的准确性和一致性研究。

Int Dent J. 2025 Jul 27;75(5):100920. doi: 10.1016/j.identj.2025.100920.

Annotation of biological samples data to standard ontologies with support from large language models.在大语言模型的支持下将生物样本数据注释到标准本体中。

Comput Struct Biotechnol J. 2025 May 26;27:2155-2167. doi: 10.1016/j.csbj.2025.05.020. eCollection 2025.

Comparing diagnostic skills in endodontic cases: dental students versus ChatGPT-4o.比较牙髓病病例中的诊断技能：牙科学生与ChatGPT-4o。

BMC Oral Health. 2025 Mar 29;25(1):457. doi: 10.1186/s12903-025-05857-y.

Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis.评估用于灼口综合征诊断的大语言模型。

J Pain Res. 2025 Mar 19;18:1387-1405. doi: 10.2147/JPR.S509845. eCollection 2025.

The Transformative Role of Artificial Intelligence in Dentistry: A Comprehensive Overview. Part 1: Fundamentals of AI, and its Contemporary Applications in Dentistry.人工智能在牙科领域的变革性作用：全面概述。第1部分：人工智能基础及其在牙科领域的当代应用。

Int Dent J. 2025 Apr;75(2):383-396. doi: 10.1016/j.identj.2025.02.005. Epub 2025 Mar 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

人工智能聊天机器人在牙髓和根尖周诊断中的比较准确性：一项横断面研究。

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study.

机构信息

出版信息

OBJECTIVES

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献