Postgraduate Program in Clinical Dentistry, University Center of Pará (CESUPA), Belém, Pará, Brazil.
Center for Health Sciences, Pontifical Catholic University of Campinas (PUC-Campinas), Postgraduate Program in Health Sciences, Campinas, São Paulo, Brazil.
Comput Biol Med. 2024 Dec;183:109332. doi: 10.1016/j.compbiomed.2024.109332. Epub 2024 Oct 30.
This study aimed to evaluate the diagnostic accuracy and treatment recommendation performance of four artificial intelligence chatbots in fictional pulpal and periradicular disease cases. Additionally, it investigated response consistency and the influence of text order and language on chatbot performance.
In this cross-sectional comparative study, eleven cases representing various pulpal and periradicular pathologies were created. These cases were presented to four chatbots (ChatGPT 3.5, ChatGPT 4.0, Bard, and Bing) in both Portuguese and English, with the information order varied (signs and symptoms first or imaging data first). Statistical analyses included the Kruskal-Wallis test, Dwass-Steel-Critchlow-Fligner pairwise comparisons, simple logistic regression, and the binomial test.
Bing and ChatGPT 4.0 achieved the highest diagnostic accuracy rates (86.4 % and 85.3 % respectively), significantly outperforming ChatGPT 3.5 (46.5 %) and Bard (28.6 %) (p < 0.001). For treatment recommendations, ChatGPT 4.0, Bing, and ChatGPT 3.5 performed similarly (94.4 %, 93.2 %, and 86.3 %, respectively), while Bard exhibited significantly lower accuracy (75 %, p < 0.001). No significant association between diagnosis and treatment accuracy was found for Bard and Bing, but a positive association was observed for ChatGPT 3.5 and ChatGPT 4.0 (p < 0.05). The overall consistency rate was 98.29 %, with no significant differences related to text order or language. Cases presented in Portuguese prompted significantly more additional information requests than those in English (33.5 % vs. 10.2 %; p < 0.001), with the relevance of this information being higher in Portuguese (29.5 % vs. 8.5 %; p < 0.001).
Bing and ChatGPT 4.0 demonstrated superior diagnostic accuracy, while Bard showed the lowest accuracy in both diagnosis and treatment recommendations. However, the clinical application of these tools necessitates critical interpretation by dentists, as chatbot responses are not consistently reliable.
本研究旨在评估四个人工智能聊天机器人在虚构的牙髓和根尖周疾病病例中的诊断准确性和治疗建议性能。此外,还研究了响应的一致性以及文本顺序和语言对聊天机器人性能的影响。
在这项横断面比较研究中,创建了 11 个代表各种牙髓和根尖周病变的病例。这些病例以葡萄牙语和英语两种语言呈现给四个聊天机器人(ChatGPT 3.5、ChatGPT 4.0、Bard 和 Bing),信息顺序不同(症状和体征先出现或影像学数据先出现)。统计分析包括 Kruskal-Wallis 检验、Dwass-Steel-Critchlow-Fligner 两两比较、简单逻辑回归和二项式检验。
Bing 和 ChatGPT 4.0 的诊断准确率最高(分别为 86.4%和 85.3%),明显优于 ChatGPT 3.5(46.5%)和 Bard(28.6%)(p<0.001)。在治疗建议方面,ChatGPT 4.0、Bing 和 ChatGPT 3.5 的表现相似(分别为 94.4%、93.2%和 86.3%),而 Bard 的准确率明显较低(75%,p<0.001)。Bard 和 Bing 的诊断和治疗准确性之间没有显著关联,但 ChatGPT 3.5 和 ChatGPT 4.0 之间存在正相关(p<0.05)。总体一致性率为 98.29%,文本顺序或语言没有显著差异。以葡萄牙语呈现的病例比以英语呈现的病例引起的额外信息请求明显更多(33.5%比 10.2%;p<0.001),且葡萄牙语的信息相关性更高(29.5%比 8.5%;p<0.001)。
Bing 和 ChatGPT 4.0 表现出较高的诊断准确性,而 Bard 在诊断和治疗建议方面的准确性最低。然而,这些工具的临床应用需要牙医进行批判性解释,因为聊天机器人的响应并不总是可靠的。