Tussie Camila, Starosta Abraham
Harvard School of Dental Medicine, Boston, Massachusetts, USA.
Boston, Massachusetts, USA.
Br Dent J. 2024 Oct 31. doi: 10.1038/s41415-024-8015-2.
Introduction With the advancement of artificial intelligence, large language models (LLMs) have emerged as technology that can generate human-like text across various domains. They hold vast potential in the dental field, able to be integrated into clinical dentistry, administrative dentistry, and for student and patient education. However, the successful integration of LLMs into dentistry is reliant on the dental knowledge of the models used, as inaccuracies can lead to significant risks in patient care and education.Aims We are the first to compare different LLMs on their dental knowledge through testing the accuracy of different model responses to Integrated National Board Dental Examination (INBDE) questions.Methods We include closed-source and open-source models and analysed responses to both 'patient box' style board questions and more traditional, textual-based, multiple-choice questions.Results For the entire INBDE question bank, ChatGPT-4 had the highest dental knowledge, with an accuracy of 75.88%, followed by Claude-2.1 with 66.38% and then Mistral-Medium at 54.77%. There was a statistically significant difference in performance across all models.Conclusion Our results highlight the high potential of LLM integration into the dental field, the importance of which LLM is chosen when developing new technologies, and the limitations that must be overcome before unsupervised clinical integration can be adopted.
引言 随着人工智能的发展,大语言模型(LLMs)已成为一种能够在各个领域生成类人文本的技术。它们在牙科领域具有巨大潜力,可被整合到临床牙科、牙科管理以及学生和患者教育中。然而,大语言模型在牙科领域的成功整合依赖于所使用模型的牙科知识,因为不准确的信息可能会给患者护理和教育带来重大风险。
目的 我们首次通过测试不同模型对综合国家牙科委员会考试(INBDE)问题的回答准确性,来比较不同大语言模型的牙科知识。
方法 我们纳入了闭源和开源模型,并分析了对“患者案例”风格的委员会问题以及更传统的基于文本的多项选择题的回答。
结果 在整个INBDE题库中,ChatGPT-4的牙科知识水平最高,准确率为75.88%,其次是Claude-2.1,准确率为66.38%,然后是Mistral-Medium,准确率为54.77%。所有模型的表现存在统计学上的显著差异。
结论 我们的结果凸显了大语言模型整合到牙科领域的巨大潜力、开发新技术时选择大语言模型的重要性,以及在采用无监督临床整合之前必须克服的局限性。