Russe Maximilian Frederik, Rau Alexander, Ermer Michael Andreas, Rothweiler René, Wenger Sina, Klöble Klara, Schulze Ralf K W, Bamberg Fabian, Schmelzeisen Rainer, Reisert Marco, Semper-Hogg Wiebke
Department of Diagnostic and Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg 79106, Germany.
Department of Neuroradiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg 79106, Germany.
Dentomaxillofac Radiol. 2024 Feb 8;53(2):109-114. doi: 10.1093/dmfr/twad015.
To develop a content-aware chatbot based on GPT-3.5-Turbo and GPT-4 with specialized knowledge on the German S2 Cone-Beam CT (CBCT) dental imaging guideline and to compare the performance against humans.
The LlamaIndex software library was used to integrate the guideline context into the chatbots. Based on the CBCT S2 guideline, 40 questions were posed to content-aware chatbots and early career and senior practitioners with different levels of experience served as reference. The chatbots' performance was compared in terms of recommendation accuracy and explanation quality. Chi-square test and one-tailed Wilcoxon signed rank test evaluated accuracy and explanation quality, respectively.
The GPT-4 based chatbot provided 100% correct recommendations and superior explanation quality compared to the one based on GPT3.5-Turbo (87.5% vs. 57.5% for GPT-3.5-Turbo; P = .003). Moreover, it outperformed early career practitioners in correct answers (P = .002 and P = .032) and earned higher trust than the chatbot using GPT-3.5-Turbo (P = 0.006).
A content-aware chatbot using GPT-4 reliably provided recommendations according to current consensus guidelines. The responses were deemed trustworthy and transparent, and therefore facilitate the integration of artificial intelligence into clinical decision-making.
基于GPT-3.5-Turbo和GPT-4开发一个具有德国S2锥形束CT(CBCT)牙科成像指南专业知识的内容感知聊天机器人,并将其性能与人类进行比较。
使用LlamaIndex软件库将指南上下文集成到聊天机器人中。根据CBCT S2指南,向内容感知聊天机器人提出40个问题,并以不同经验水平的早期职业和资深从业者作为参考。从推荐准确性和解释质量方面比较聊天机器人的性能。卡方检验和单尾Wilcoxon符号秩检验分别评估准确性和解释质量。
与基于GPT3.5-Turbo的聊天机器人相比,基于GPT-4的聊天机器人提供了100%正确的推荐,且解释质量更高(GPT-3.5-Turbo为87.5%,GPT-4为57.5%;P = 0.003)。此外,在正确答案方面,它优于早期职业从业者(P = 0.002和P = 0.032),并且比使用GPT-3.5-Turbo的聊天机器人获得了更高的信任度(P = 0.006)。
使用GPT-4的内容感知聊天机器人能够根据当前的共识指南可靠地提供推荐。其回答被认为是值得信赖和透明的,因此有助于将人工智能整合到临床决策中。