Balel Yunus
Department of Oral and Maxillofacial Surgery, Faculty of Dentistry, Sivas Cumhuriyet University, Sivas 58000, Turkiye.
J Stomatol Oral Maxillofac Surg. 2024 Oct 9;126(4):102114. doi: 10.1016/j.jormas.2024.102114.
The purpose of this study is to evaluate the performance of Scholar GPT in answering technical questions in the field of oral and maxillofacial surgery and to conduct a comparative analysis with the results of a previous study that assessed the performance of ChatGPT.
Scholar GPT was accessed via ChatGPT (www.chatgpt.com) on March 20, 2024. A total of 60 technical questions (15 each on impacted teeth, dental implants, temporomandibular joint disorders, and orthognathic surgery) from our previous study were used. Scholar GPT's responses were evaluated using a modified Global Quality Scale (GQS). The questions were randomized before scoring using an online randomizer (www.randomizer.org). A single researcher performed the evaluations at three different times, three weeks apart, with each evaluation preceded by a new randomization. In cases of score discrepancies, a fourth evaluation was conducted to determine the final score.
Scholar GPT performed well across all technical questions, with an average GQS score of 4.48 (SD=0.93). Comparatively, ChatGPT's average GQS score in previous study was 3.1 (SD=1.492). The Wilcoxon Signed-Rank Test indicated a statistically significant higher average score for Scholar GPT compared to ChatGPT (Mean Difference = 2.00, SE = 0.163, p < 0.001). The Kruskal-Wallis Test showed no statistically significant differences among the topic groups (χ² = 0.799, df = 3, p = 0.850, ε² = 0.0135).
Scholar GPT demonstrated a generally high performance in technical questions within oral and maxillofacial surgery and produced more consistent and higher-quality responses compared to ChatGPT. The findings suggest that GPT models based on academic databases can provide more accurate and reliable information. Additionally, developing a specialized GPT model for oral and maxillofacial surgery could ensure higher quality and consistency in artificial intelligence-generated information.
本研究旨在评估Scholar GPT在回答口腔颌面外科领域技术问题方面的表现,并与之前评估ChatGPT表现的研究结果进行对比分析。
2024年3月20日通过ChatGPT(www.chatgpt.com)访问Scholar GPT。使用了我们之前研究中的60个技术问题(关于阻生牙、牙种植体、颞下颌关节紊乱和正颌外科各15个)。Scholar GPT的回答使用改良的全球质量量表(GQS)进行评估。在评分前使用在线随机工具(www.randomizer.org)对问题进行随机排序。一名研究人员在三个不同时间进行评估,每次间隔三周,每次评估前都进行新的随机排序。在分数出现差异的情况下,进行第四次评估以确定最终分数。
Scholar GPT在所有技术问题上表现良好,平均GQS得分为4.48(标准差=0.93)。相比之下,ChatGPT在之前研究中的平均GQS得分为3.1(标准差=1.492)。Wilcoxon符号秩检验表明,与ChatGPT相比,Scholar GPT的平均得分在统计学上显著更高(平均差异=2.00,标准误=0.163,p<0.001)。Kruskal-Wallis检验显示各主题组之间在统计学上无显著差异(χ²=0.799,自由度=3,p=0.850,ε²=0.0135)。
Scholar GPT在口腔颌面外科的技术问题上总体表现出色,与ChatGPT相比,产生了更一致、质量更高的回答。研究结果表明,基于学术数据库的GPT模型可以提供更准确可靠的信息。此外,开发专门用于口腔颌面外科的GPT模型可以确保人工智能生成信息的更高质量和一致性。