Suppr超能文献

用于牙科系统评价筛选步骤的大语言模型。

Large language models for the screening step in systematic reviews in dentistry.

作者信息

Rokhshad Rata, Bagherianlemraski Mobina, Ehsani Sarah Sadat, Haghighat Sara, Schwendicke Falk

机构信息

Department of Pediatric Dentistry, Loma Linda School of Dentistry, Loma Linda, USA.

Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada; Topic Group Dental Diagnostics and Digital Dentistry, ITU/WHO Focus Group AI on Health, Berlin, Germany.

出版信息

J Dent. 2025 Sep;160:105877. doi: 10.1016/j.jdent.2025.105877. Epub 2025 Jun 4.

Abstract

OBJECTIVES

This study assessed the performance of chatbots in the screening step of a systematic review (SR) with an exemplary focus on tooth segmentation on dental radiographs using artificial intelligence (AI).

METHODS

A comprehensive systematic search was performed in December 2024 across seven databases: PubMed, Scopus, Web of Science, Embase, IEEE, Google Scholar, and arXiv. Five chatbots-ChatGPT-4, Claude 2 100k, Claude Instant 100k, Meta's LLaMA 3, and Gemini-were evaluated for their ability to screen articles on tooth segmentation on radiographs using AI. The evaluations took place from January to February 2025, focusing on performance metrics such as accuracy, precision, sensitivity, specificity, and F1-score for screening quality measured against expert reviewers' screening, as well as Cohen's Kappa for inter-rater agreement between different chatbots.

RESULTS

A total of 891 studies were screened. Significant variability in the number of included or excluded studies was observed (p < 0.001/Chi-square), with Claude-instant-100k having the highest inclusion rate (54.88 %) and ChatGPT-4 the lowest (29.52 %). Gemini excluded the most studies (67.90 %), while ChatGPT-4 marked the highest number of studies for full-text review (5.39 %). Fleiss' Kappa (-0.147, p < 0.001) indicated systematic disagreement between chatbots worse than random chance. Performance metrics varied; ChatGPT-4 had the highest precision (24 %) and accuracy (75 %) measured against human expert reviewers, while Claude-instant-100k had the highest sensitivity (96 %) but the lowest precision (16 %).

CONCLUSION

Chatbots showed limited accuracy during study screening and low inter-rater agreement. There remains the need for human oversight during systematic reviewing.

CLINICAL SIGNIFICANCE

Theoretically, Chatbots can streamline SR tasks such as screening. However, human oversight remains critical to maintain the integrity of the review.

摘要

目的

本研究以使用人工智能(AI)对牙科X光片进行牙齿分割为例,评估了聊天机器人在系统评价(SR)筛选步骤中的表现。

方法

2024年12月在七个数据库中进行了全面的系统检索:PubMed、Scopus、Web of Science、Embase、IEEE、谷歌学术和arXiv。评估了五个聊天机器人——ChatGPT-4、Claude 2 100k、Claude Instant 100k、Meta的LLaMA 3和Gemini——使用AI筛选关于X光片牙齿分割文章的能力。评估于2025年1月至2月进行,重点关注针对专家评审员筛选的准确性、精确性、敏感性、特异性和F1分数等性能指标,以及不同聊天机器人之间评分者间一致性的Cohen's Kappa系数。

结果

共筛选了891项研究。观察到纳入或排除研究数量存在显著差异(p < 0.001/卡方检验),Claude-instant-100k的纳入率最高(54.88%),ChatGPT-4最低(29.52%)。Gemini排除的研究最多(67.90%),而ChatGPT-4标记的全文评审研究数量最多(5.39%)。Fleiss' Kappa系数(-0.147,p < 0.001)表明聊天机器人之间存在系统性分歧,且比随机概率更差。性能指标各不相同;与人类专家评审员相比,ChatGPT-4的精确性最高(24%)和准确性最高(75%),而Claude-instant-100k的敏感性最高(96%)但精确性最低(16%)。

结论

聊天机器人在研究筛选过程中准确性有限,评分者间一致性较低。在系统评价过程中仍需要人工监督。

临床意义

从理论上讲,聊天机器人可以简化筛选等系统评价任务。然而,人工监督对于维持评价的完整性仍然至关重要。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验