Suppr超能文献

评估六种大语言模型在儿童牙科领域基于证据的潜力:生成式人工智能的比较研究

Evaluating the evidence-based potential of six large language models in paediatric dentistry: a comparative study on generative artificial intelligence.

作者信息

Dermata Anastasia, Arhakis Aristidis, Makrygiannakis Miltiadis A, Giannakopoulos Kostis, Kaklamanos Eleftherios G

机构信息

School of Dentistry, Aristotle University of Thessaloniki, 54124, Thessaloniki, Greece.

School of Dentistry, National and Kapodistrian University of Athens, 11527, Athens, Greece.

出版信息

Eur Arch Paediatr Dent. 2025 Jun;26(3):527-535. doi: 10.1007/s40368-025-01012-x. Epub 2025 Feb 22.

Abstract

PURPOSE

The use of large language models (LLMs) in generative artificial intelligence (AI) is rapidly increasing in dentistry. However, their reliability is yet to be fully founded. This study aims to evaluate the diagnostic accuracy, clinical applicability, and patient education potential of LLMs in paediatric dentistry, by evaluating the responses of six LLMs: Google AI's Gemini and Gemini Advanced, OpenAI's ChatGPT-3.5, -4o and -4, and Microsoft's Copilot.

METHODS

Ten open-type clinical questions, relevant to paediatric dentistry were posed to the LLMs. The responses were graded by two independent evaluators from 0 to 10 using a detailed rubric. After 4 weeks, answers were reevaluated to assess intra-evaluator reliability. Statistical comparisons used Friedman's and Wilcoxon's and Kruskal-Wallis tests to assess the model that provided the most comprehensive, accurate, explicit and relevant answers.

RESULTS

Variations of results were noted. Chat GPT 4 answers were scored as the best (average score 8.08), followed by the answers of Gemini Advanced (8.06), ChatGPT 4o (8.01), ChatGPT 3.5 (7.61), Gemini (7,32) and Copilot (5.41). Statistical analysis revealed that Chat GPT 4 outperformed all other LLMs, and the difference was statistically significant. Despite variations and different responses to the same queries, remarkable similarities were observed. Except for Copilot, all chatbots managed to achieve a score level above 6.5 on all queries.

CONCLUSION

This study demonstrates the potential use of language models (LLMs) in supporting evidence-based paediatric dentistry. Nevertheless, they cannot be regarded as completely trustworthy. Dental professionals should critically use AI models as supportive tools and not as a substitute of overall scientific knowledge and critical thinking.

摘要

目的

在牙科领域,生成式人工智能(AI)中大型语言模型(LLMs)的应用正在迅速增加。然而,其可靠性尚未完全确立。本研究旨在通过评估六个大型语言模型的回答来评估大型语言模型在儿童牙科中的诊断准确性、临床适用性和患者教育潜力,这六个大型语言模型分别是:谷歌人工智能的Gemini和Gemini Advanced、OpenAI的ChatGPT-3.5、-4o和-4,以及微软的Copilot。

方法

向大型语言模型提出了10个与儿童牙科相关的开放式临床问题。两名独立评估人员使用详细的评分标准对回答从0到10进行评分。4周后,对答案进行重新评估以评估评估人员内部的可靠性。统计比较使用弗里德曼检验、威尔科克森检验和克鲁斯卡尔-沃利斯检验来评估提供最全面、准确、明确和相关答案的模型。

结果

注意到结果存在差异。Chat GPT 4的回答得分最高(平均得分8.08),其次是Gemini Advanced(8.06)、ChatGPT 4o(8.01)、ChatGPT 3.5(7.61)、Gemini(7.32)和Copilot(5.41)。统计分析表明,Chat GPT 4的表现优于所有其他大型语言模型,且差异具有统计学意义。尽管对相同问题的回答存在差异,但也观察到了显著的相似之处。除了Copilot,所有聊天机器人在所有问题上的得分都达到了6.5以上。

结论

本研究证明了语言模型在支持循证儿童牙科方面的潜在用途。然而,它们不能被视为完全值得信赖。牙科专业人员应审慎地将人工智能模型用作辅助工具,而不是替代全面的科学知识和批判性思维。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/d6ad/12165978/2d6d6c5f4de4/40368_2025_1012_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验