聊天机器人与临床医生回答儿科牙科问题的准确性和一致性：一项试点研究。

Accuracy and consistency of chatbots versus clinicians for answering pediatric dentistry questions: A pilot study.

机构信息

Department of Pediatric Dentistry, University of Alabama at Birmingham, Birmingham, AL, USA.

出版信息

J Dent. 2024 May;144:104938. doi: 10.1016/j.jdent.2024.104938. Epub 2024 Apr 3.

DOI:10.1016/j.jdent.2024.104938

PMID:38499280

Abstract

OBJECTIVES

Artificial Intelligence has applications such as Large Language Models (LLMs), which simulate human-like conversations. The potential of LLMs in healthcare is not fully evaluated. This pilot study assessed the accuracy and consistency of chatbots and clinicians in answering common questions in pediatric dentistry.

METHODS

Two expert pediatric dentists developed thirty true or false questions involving different aspects of pediatric dentistry. Publicly accessible chatbots (Google Bard, ChatGPT4, ChatGPT 3.5, Llama, Sage, Claude 2 100k, Claude-instant, Claude-instant-100k, and Google Palm) were employed to answer the questions (3 independent new conversations). Three groups of clinicians (general dentists, pediatric specialists, and students; n = 20/group) also answered. Responses were graded by two pediatric dentistry faculty members, along with a third independent pediatric dentist. Resulting accuracies (percentage of correct responses) were compared using analysis of variance (ANOVA), and post-hoc pairwise group comparisons were corrected using Tukey's HSD method. ACronbach's alpha was calculated to determine consistency.

RESULTS

Pediatric dentists were significantly more accurate (mean±SD 96.67 %± 4.3 %) than other clinicians and chatbots (p < 0.001). General dentists (88.0 % ± 6.1 %) also demonstrated significantly higher accuracy than chatbots (p < 0.001), followed by students (80.8 %±6.9 %). ChatGPT showed the highest accuracy (78 %±3 %) among chatbots. All chatbots except ChatGPT3.5 showed acceptable consistency (Cronbach alpha>0.7).

CLINICAL SIGNIFICANCE

Based on this pilot study, chatbots may be valuable adjuncts for educational purposes and for distributing information to patients. However, they are not yet ready to serve as substitutes for human clinicians in diagnostic decision-making.

CONCLUSION

In this pilot study, chatbots showed lower accuracy than dentists. Chatbots may not yet be recommended for clinical pediatric dentistry.

摘要

目的

人工智能具有大语言模型（LLM）等应用，可模拟人类对话。LLM 在医疗保健中的潜力尚未得到充分评估。本初步研究评估了聊天机器人和临床医生回答儿科牙科常见问题的准确性和一致性。

方法

两位儿科专家牙医制定了涉及儿科牙科不同方面的三十个真假问题。使用公共可访问的聊天机器人（Google Bard、ChatGPT4、ChatGPT 3.5、Llama、Sage、Claude 2 100k、Claude-instant、Claude-instant-100k 和 Google Palm）回答问题（3 次新的独立对话）。三组临床医生（普通牙医、儿科专家和学生；每组 n = 20）也回答了问题。由两名儿科牙科教员以及第三名独立的儿科牙医对回答进行评分。使用方差分析（ANOVA）比较产生的准确性（正确回答的百分比），并使用 Tukey 的 HSD 方法对事后两两组比较进行校正。使用 Cronbach's alpha 计算一致性。

结果

儿科牙医的准确性明显高于其他临床医生和聊天机器人（均为 p < 0.001）（平均±SD 96.67%±4.3%）。普通牙医（88.0%±6.1%）的准确性也明显高于聊天机器人（均为 p < 0.001），其次是学生（80.8%±6.9%）。ChatGPT 在聊天机器人中表现出最高的准确性（78%±3%）。除了 ChatGPT3.5 之外，所有聊天机器人的一致性都在可接受范围内（Cronbach alpha>0.7）。