基于大语言模型的聊天机器人与临床医生作为正畸学信息来源的可靠性：一项比较分析。

Reliability of Large Language Model-Based Chatbots Versus Clinicians as Sources of Information on Orthodontics: A Comparative Analysis.

作者信息

Martina Stefano, Cannatà Davide, Paduano Teresa, Schettino Valentina, Giordano Francesco, Galdi Marzio

机构信息

Department of Medicine, Surgery and Dentistry "Scuola Medica Salernitana", University of Salerno, Via Allende, 84081 Baronissi, Italy.

出版信息

Dent J (Basel). 2025 Jul 24;13(8):343. doi: 10.3390/dj13080343.

DOI:10.3390/dj13080343

PMID:40863046

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12385111/

Abstract

: The present cross-sectional analysis aimed to investigate whether Large Language Model-based chatbots can be used as reliable sources of information in orthodontics by evaluating chatbot responses and comparing them to those of dental practitioners with different levels of knowledge. : Eight true and false frequently asked orthodontic questions were submitted to five leading chatbots (ChatGPT-4, Claude-3-Opus, Gemini 2.0 Flash Experimental, Microsoft Copilot, and DeepSeek). The consistency of the answers given by chatbots at four different times was assessed using Cronbach's α. Chi-squared test was used to compare chatbot responses with those given by two groups of clinicians, i.e., general dental practitioners (GDPs) and orthodontic specialists (Os) recruited in an online survey via social media, and differences were considered significant when < 0.05. Additionally, chatbots were asked to provide a justification for their dichotomous responses using a chain-of-through prompting approach and rating the educational value according to the Global Quality Scale (GQS). : A high degree of consistency in answering was found for all analyzed chatbots (α > 0.80). When comparing chatbot answers with GDP and O ones, statistically significant differences were found for almost all the questions ( < 0.05). When evaluating the educational value of chatbot responses, DeepSeek achieved the highest GQS score (median 4.00; interquartile range 0.00), whereas CoPilot had the lowest one (median 2.00; interquartile range 2.00). : Although chatbots yield somewhat useful information about orthodontics, they can provide misleading information when dealing with controversial topics.

摘要

本横断面分析旨在通过评估基于大语言模型的聊天机器人的回答，并将其与不同知识水平的牙科从业者的回答进行比较，来研究这些聊天机器人是否可作为正畸学中可靠的信息来源。向五个领先的聊天机器人（ChatGPT-4、Claude-3-Opus、Gemini 2.0 Flash Experimental、Microsoft Copilot和DeepSeek）提交了八个正畸常见的是非问题。使用克朗巴哈α系数评估聊天机器人在四个不同时间给出答案的一致性。卡方检验用于比较聊天机器人的回答与通过社交媒体在线调查招募的两组临床医生（即普通牙科从业者（GDPs）和正畸专科医生（Os））的回答，当<0.05时差异被认为具有统计学意义。此外，要求聊天机器人使用推理提示方法为其二分法回答提供理由，并根据全球质量量表（GQS）对教育价值进行评分。所有分析的聊天机器人在回答方面都表现出高度一致性（α>0.80）。将聊天机器人的答案与GDP和Os的答案进行比较时，几乎所有问题都发现了统计学上的显著差异（<0.05）。在评估聊天机器人回答的教育价值时，DeepSeek获得了最高的GQS分数（中位数4.00；四分位间距0.00），而Copilot的分数最低（中位数2.00；四分位间距2.00）。虽然聊天机器人能提供一些有关正畸学的有用信息，但在处理有争议的话题时，它们可能会提供误导性信息。