Koidou Vasiliki P, Chatzopoulos Georgios S, Tsalikis Lazaros, Kaklamanos Eleutherios G
Research Associate, Centre for Oral Immunobiology and Regenerative Medicine and Centre for Oral Clinical Research, Institute of Dentistry, Queen Mary University of London (QMUL), London, England, UK.
PhD candidate, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; and Visiting Research Assistant Professor, Division of Periodontology, Department of Developmental and Surgical Sciences, School of Dentistry, University of Minnesota, Minneapolis, Minn.
J Prosthet Dent. 2025 Mar 6. doi: 10.1016/j.prosdent.2025.02.008.
Artificial intelligence (AI) has gained significant recent attention and several AI applications, such as the Large Language Models (LLMs) are promising for use in clinical medicine and dentistry. Nevertheless, assessing the performance of LLMs is essential to identify potential inaccuracies or even prevent harmful outcomes.
The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to clinical questions in the field of implant dentistry.
A total of 10 open-ended questions pertinent to prevention and treatment of peri-implant disease were posed to 4 distinct LLMs including ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers were evaluated independently by 2 periodontists against scientific evidence for comprehensiveness, scientific accuracy, clarity, and relevance. The LLMs responses received scores ranging from 0 (minimum) to 10 (maximum) points. To assess the intra-evaluator reliability, a re-evaluation of the LLM responses was performed after 2 weeks and Cronbach α and interclass correlation coefficient (ICC) was used (α=.05).
The scores assigned by the examiners on the 2 occasions were not statistically different and each LLM received an average score. Google Gemini Advanced ranked higher than the rest of the LLMs, while Google Gemini scored worst. The difference between Google Gemini Advanced and Google Gemini was statistically significantly different (P=.005).
Dental professionals need to be cautious when using LLMs to access content related to peri-implant diseases. LLMs cannot currently replace dental professionals and caution should be exercised when used in patient care.
人工智能(AI)最近受到了广泛关注,一些人工智能应用,如大语言模型(LLMs)在临床医学和牙科领域具有应用前景。然而,评估大语言模型的性能对于识别潜在的不准确之处甚至预防有害结果至关重要。
本研究的目的是评估和比较4种大语言模型对种植体周围疾病领域临床问题提供的基于证据的回答的潜力。
向包括ChatGPT 4.0、谷歌Gemini、谷歌Gemini Advanced和微软Copilot在内的4种不同的大语言模型提出了总共10个与种植体周围疾病预防和治疗相关的开放式问题。两名牙周病专家根据科学证据对回答进行独立评估,评估内容包括全面性、科学准确性、清晰度和相关性。大语言模型的回答得分范围为0(最低)至10(最高)分。为了评估评估者内部的可靠性,在2周后对大语言模型的回答进行了重新评估,并使用了Cronbach α和组内相关系数(ICC)(α = 0.05)。
两次评估中检查人员给出的分数没有统计学差异,每个大语言模型都获得了一个平均分数。谷歌Gemini Advanced的排名高于其他大语言模型,而谷歌Gemini得分最差。谷歌Gemini Advanced和谷歌Gemini之间的差异具有统计学显著性(P = 0.005)。
牙科专业人员在使用大语言模型获取与种植体周围疾病相关的内容时需要谨慎。大语言模型目前无法取代牙科专业人员,在用于患者护理时应谨慎行事。