牙周病学中的大语言模型：评估它们在临床相关问题中的表现。

Large language models in periodontology: Assessing their performance in clinically relevant questions.

作者信息

Chatzopoulos Georgios S, Koidou Vasiliki P, Tsalikis Lazaros, Kaklamanos Eleftherios G

机构信息

PhD candidate, Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece; and Visiting Research Assistant Professor, Division of Periodontology, Department of Developmental and Surgical Sciences, School of Dentistry, University of Minnesota, Minneapolis, Minn.

Research Assistant, Centre for Oral Immunobiology and Regenerative Medicine and Centre for Oral Clinical Research, Institute of Dentistry, Queen Mary University London (QMUL), London, England, UK.

出版信息

J Prosthet Dent. 2024 Nov 18. doi: 10.1016/j.prosdent.2024.10.020.

DOI:10.1016/j.prosdent.2024.10.020

PMID:39562221

Abstract

STATEMENT OF PROBLEM

Although the use of artificial intelligence (AI) seems promising and may assist dentists in clinical practice, the consequences of inaccurate or even harmful responses are paramount. Research is required to examine whether large language models (LLMs) can be used in accessing periodontal content reliably.

PURPOSE

The purpose of this study was to evaluate and compare the evidence-based potential of answers provided by 4 LLMs to common clinical questions in the field of periodontology.

MATERIAL AND METHODS

A total of 10 open-ended questions pertinent to periodontology were posed to 4 distinct LLMs: ChatGPT model GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot. The answers to each question were evaluated independently by 2 periodontists against robust scientific evidence based on a predefined rubric assessing the comprehensiveness, scientific accuracy, clarity, and relevance. Each response received a score ranging from 0 (minimum) to 10 (maximum). After a period of 2 weeks from initial evaluation, the answers were re-graded independently to gauge intra-evaluator reliability. Inter-evaluator reliability was assessed using correlation tests, while Cronbach alpha and interclass correlation coefficient were used to measure overall reliability. The Kruskal-Wallis test was employed to compare the scores given by different LLMs.

RESULTS

The scores provided by the 2 evaluators for both evaluations were statistically similar (P values ranging from .083 to >;.999), therefore an average score was calculated for each LLM. Both evaluators gave the highest scores to the answers generated by ChatGPT 4.0, while Google Gemini had the lowest scores. ChatGPT 4.0 received the highest average score, while significant differences were detected between ChatGPT 4.0 and Google Gemini (P=.042). ChatGPT 4.0 answers were found to be highly comprehensive, with scientific accuracy, clarity, and relevance.

CONCLUSIONS

Professionals need to be aware of the limitations of LLMs when utilizing them. These models must not replace dental professionals as improper use may negatively impact patient care. Chat GPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft CoPilot performed relatively well with Chat GPT 4.0 demonstrating the highest performance.

摘要

问题陈述

尽管人工智能（AI）的应用前景广阔，可能会在临床实践中帮助牙医，但不准确甚至有害的回答所带来的后果至关重要。需要进行研究来检验大语言模型（LLMs）是否能够可靠地用于获取牙周相关内容。

目的

本研究的目的是评估和比较4个大语言模型针对牙周病学领域常见临床问题所提供答案的循证潜力。

材料与方法

向4个不同的大语言模型提出了总共10个与牙周病学相关的开放式问题：ChatGPT模型GPT 4.0、谷歌Gemini、谷歌Gemini Advanced和微软Copilot。两位牙周病专家根据预先定义的评估全面性、科学准确性、清晰度和相关性的评分标准，对照可靠的科学证据对每个问题的答案进行独立评估。每个回答的得分范围为0（最低）至10（最高）。在初始评估两周后，对答案进行重新独立评分，以评估评估者内部的可靠性。使用相关性测试评估评估者之间的可靠性，并使用Cronbach α系数和组内相关系数来衡量总体可靠性。采用Kruskal-Wallis检验比较不同大语言模型给出的分数。

结果

两位评估者在两次评估中给出的分数在统计学上相似（P值范围为0.083至>0.999），因此计算每个大语言模型的平均分数。两位评估者都给ChatGPT 4.0生成的答案打了最高分，而谷歌Gemini的分数最低。ChatGPT 4.0获得的平均分数最高，并且在ChatGPT 4.0和谷歌Gemini之间检测到显著差异（P = 0.042）。发现ChatGPT 4.0的答案非常全面，具有科学准确性、清晰度和相关性。