Albagieh Hamad, Alzeer Zaid O, Alasmari Osama N, Alkadhi Abdullah A, Naitah Abdulaziz N, Almasaad Khaled F, Alshahrani Turki S, Alshahrani Khalid S, Almahmoud Mohammed I
Oral Medicine, King Saud University, Riyadh, SAU.
Dentistry, College of Dentistry, King Saud University, Riyadh, SAU.
Cureus. 2024 Jan 3;16(1):e51584. doi: 10.7759/cureus.51584. eCollection 2024 Jan.
INTRODUCTION: Artificial intelligence (AI) is a field of computer science that seeks to build intelligent machines that can carry out tasks that usually necessitate human intelligence. AI may help dentists with a variety of dental tasks, including clinical diagnosis and treatment planning. This study aims to compare the performance of AI and oral medicine residents in diagnosing different cases, providing treatment, and determining if it is reliable to assist them in their field of work. METHODS: The study conducted a comparative analysis of the responses from third- and fourth-year residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry. The residents were given a closed multiple-choice test consisting of 19 questions with four response options labeled A-D and one question with five response options labeled A-E. The test was administered via Google Forms, and each resident's response was stored electronically in an Excel sheet (Microsoft® Corp., Redmond, WA). The residents' answers were then compared to the responses generated by three major language models: OpenAI, Stablediffusion, and PopAI. The questions were inputted into the language models in the same format as the original test, and prior to each question, an artificial intelligence chat session was created to eliminate memory retention bias. The input was done on November 19, 2023, the same day the official multiple-choice test was administered. The study had a sample size of 20 residents trained in Oral Medicine and Pathology at King Saud University, College of Dentistry, consisting of both third-year and fourth-year residents. RESULT: The responses of three large language models (LLM), including OpenAI, Stablediffusion, and PopAI, as well as the responses of 20 senior residents for 20 clinical cases about oral lesion diagnosis. There were no significant variations observed for the remaining questions in the responses to only two questions (10%). For the remaining questions, there were no significant differences. The median (IQR) score of LLMs was 50.0 (45.0 to 60.0), with a minimum of 40 (for stable diffusion) and a maximum of 70 (for OpenAI). The median (IQR) score of senior residents was 65.0 (55.0-75.0). The highest and lowest scores of residents were 40 and 90, respectively. There was no significant difference in the percent scores of residents and LLMs (p = 0.211). The agreement level was measured using the Kappa value. The agreement among senior dental residents was observed to be weak, with a Kappa value of 0.396. In contrast, the agreement among LLMs demonstrated a moderate level, with a Kappa value of 0.622, suggesting a more cohesive alignment in responses among the artificial intelligence models. When comparing residents' responses with those generated by different OpenAI models, including OpenAI, Stablediffusion, and PopAI, the agreement levels were consistently categorized as weak, with Kappa values of 0.402, 0.381, and 0.392, respectively. CONCLUSION: What the current study reveals is that when comparing the response score, there is no significant difference, in contrast to the agreement analysis among the residents, which was low compared to the LLMs, in which it was high. Dentists should consider that AI is very beneficial in providing diagnosis and treatment and use it to assist them.
引言:人工智能(AI)是计算机科学的一个领域,旨在构建能够执行通常需要人类智能的任务的智能机器。人工智能可以在各种牙科任务中帮助牙医,包括临床诊断和治疗计划。本研究旨在比较人工智能和口腔医学住院医师在诊断不同病例、提供治疗以及确定其在工作领域提供协助是否可靠方面的表现。 方法:该研究对沙特国王大学牙科学院接受口腔医学和病理学培训的三年级和四年级住院医师的回答进行了比较分析。住院医师接受了一项封闭式多项选择题测试,该测试由19个有A - D四个选项的问题和1个有A - E五个选项的问题组成。测试通过谷歌表单进行,每个住院医师的回答以电子方式存储在Excel工作表(微软公司,华盛顿州雷德蒙德)中。然后将住院医师的答案与三个主要语言模型生成的回答进行比较:OpenAI、Stablediffusion和PopAI。问题以与原始测试相同的格式输入到语言模型中,并且在每个问题之前创建一个人工智能聊天会话以消除记忆保留偏差。输入操作于2023年11月19日进行,即官方多项选择题测试的同一天。该研究的样本量为20名在沙特国王大学牙科学院接受口腔医学和病理学培训的住院医师,包括三年级和四年级住院医师。 结果:三个大型语言模型(LLM),即OpenAI、Stablediffusion和PopAI的回答,以及20名高级住院医师对20个口腔病变诊断临床病例的回答。在仅两个问题(10%)的回答中,其余问题未观察到显著差异。对于其余问题,没有显著差异。语言模型的中位数(IQR)分数为50.0(45.0至60.0),最低为40(Stablediffusion),最高为70(OpenAI)。高级住院医师的中位数(IQR)分数为65.0(55.0 - 75.0)。住院医师的最高和最低分数分别为40和90。住院医师和语言模型的百分比分数没有显著差异(p = 0.211)。使用Kappa值测量一致性水平。观察到牙科高级住院医师之间的一致性较弱,Kappa值为0.396。相比之下,语言模型之间的一致性表现为中等水平,Kappa值为0.622,这表明人工智能模型之间的回答具有更强的一致性。当将住院医师的回答与不同的OpenAI模型(包括OpenAI、Stablediffusion和PopAI)生成的回答进行比较时,一致性水平始终被归类为较弱,Kappa值分别为0.402、0.38I和0.392。 结论:当前研究表明,在比较回答分数时,没有显著差异,与之形成对比的是,住院医师之间的一致性分析较低,而语言模型之间的一致性较高。牙医应考虑到人工智能在提供诊断和治疗方面非常有益,并利用它来协助自己。
Health Technol Assess. 2001
Zhonghua Kou Qiang Yi Xue Za Zhi. 2025-7-30
Cochrane Database Syst Rev. 2022-5-20
Psychopharmacol Bull. 2024-7-8
Cochrane Database Syst Rev. 2005-7-20
Swiss Dent J. 2023-10-4
Swiss Dent J. 2023-10-6
Cureus. 2023-4-30
Cureus. 2023-4-8
J Esthet Restor Dent. 2023-10
Diabetes Metab Syndr. 2023-4
PLOS Digit Health. 2023-2-9