大语言模型在回答牙周根分叉病变管理临床问题中的性能评估

Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.

作者信息

Chatzopoulos Georgios S, Koidou Vasiliki P, Tsalikis Lazaros, Kaklamanos Eleftherios G

机构信息

Department of Preventive Dentistry, Periodontology and Implant Biology, School of Dentistry, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece.

Division of Periodontology, Department of Developmental and Surgical Sciences, School of Dentistry, University of Minnesota, Minneapolis, MN 55455, USA.

出版信息

Dent J (Basel). 2025 Jun 18;13(6):271. doi: 10.3390/dj13060271.

DOI:10.3390/dj13060271

PMID:40559174

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12191798/

Abstract

: Large Language Models (LLMs) are artificial intelligence (AI) systems with the capacity to process vast amounts of text and generate human-like language, offering the potential for improved information retrieval in healthcare. This study aimed to assess and compare the evidence-based potential of answers provided by four LLMs to common clinical questions concerning the management and treatment of periodontal furcation defects. : Four LLMs-ChatGPT 4.0, Google Gemini, Google Gemini Advanced, and Microsoft Copilot-were used to answer ten clinical questions related to periodontal furcation defects. The LLM-generated responses were compared against a "gold standard" derived from the European Federation of Periodontology (EFP) S3 guidelines and recent systematic reviews. Two board-certified periodontists independently evaluated the answers for comprehensiveness, scientific accuracy, clarity, and relevance using a predefined rubric and a scoring system of 0-10. : The study found variability in LLM performance across the evaluation criteria. Google Gemini Advanced generally achieved the highest average scores, particularly in comprehensiveness and clarity, while Google Gemini and Microsoft Copilot tended to score lower, especially in relevance. However, the Kruskal-Wallis test revealed no statistically significant differences in the overall average scores among the LLMs. Evaluator agreement and intra-evaluator reliability were high. : While LLMs demonstrate the potential to answer clinical questions related to furcation defect management, their performance varies. LLMs showed different comprehensiveness, scientific accuracy, clarity, and relevance degrees. Dental professionals should be aware of LLMs' capabilities and limitations when seeking clinical information.

摘要

大语言模型（LLMs）是一种人工智能（AI）系统，能够处理大量文本并生成类人文本，为改善医疗保健中的信息检索提供了潜力。本研究旨在评估和比较四种大语言模型针对牙周根分叉病变管理和治疗的常见临床问题所提供答案的循证潜力。

使用了四种大语言模型——ChatGPT 4.0、谷歌Gemini、谷歌Gemini Advanced和微软Copilot——来回答与牙周根分叉病变相关的十个临床问题。将大语言模型生成的回答与源自欧洲牙周病学联合会（EFP）S3指南和近期系统评价的“金标准”进行比较。两名获得董事会认证的牙周病专家使用预定义的评分标准和0至10分的评分系统，独立评估答案的全面性、科学准确性、清晰度和相关性。

研究发现，大语言模型在各项评估标准上的表现存在差异。谷歌Gemini Advanced通常获得最高平均分，尤其是在全面性和清晰度方面，而谷歌Gemini和微软Copilot的得分往往较低，特别是在相关性方面。然而，Kruskal-Wallis检验显示，大语言模型之间的总体平均分没有统计学上的显著差异。评估者之间的一致性和评估者内部的可靠性都很高。

虽然大语言模型展示了回答与根分叉病变管理相关临床问题的潜力，但其表现各不相同。大语言模型在全面性、科学准确性、清晰度和相关性方面呈现出不同程度。牙科专业人员在寻求临床信息时应了解大语言模型的能力和局限性。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

大语言模型在回答牙周根分叉病变管理临床问题中的性能评估

Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.

作者信息

机构信息

出版信息

相似文献

本文引用的文献

大语言模型在回答牙周根分叉病变管理临床问题中的性能评估

Evaluation of Large Language Model Performance in Answering Clinical Questions on Periodontal Furcation Defect Management.

作者信息

机构信息

出版信息

相似文献

本文引用的文献