大型语言模型与系统评价及综述概述（PRISMA 2020和PRIOR）中报告指南的依从性分析

Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).

作者信息

Forero Diego A, Abreu Sandra E, Tovar Blanca E, Oermann Marilyn H

机构信息

School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia.

Psychology Program, Fundación Universitaria del Área Andina, Medellín, Colombia.

出版信息

J Med Syst. 2025 Jun 12;49(1):80. doi: 10.1007/s10916-025-02212-0.

DOI:10.1007/s10916-025-02212-0

PMID:40504403

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12162794/

Abstract

In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.

摘要

在循证实践（EBP）的背景下，系统评价（SRs）、荟萃分析（MAs）和综述概述已成为研究结果综合的基石。系统评价和荟萃分析的首选报告项目（PRISMA）2020以及综述概述的首选报告项目（PRIOR）声明分别已成为SRs/MAs和综述概述的主要报告指南。近年来，生成式人工智能（genAI）的进展被认为是科学研究中潜在的重大范式转变。本研究的主要目的是在20篇系统评价和20篇综述概述的样本中，检验四种大语言模型（LLMs）分析对PRISMA 2020和PRIOR的遵循情况的性能。我们测试了四种常用大语言模型的免费版本：ChatGPT（GPT-4o）、渊亭（V3）、Gemini（2.0 Flash）和通义千问（2.5 Max）。使用多种统计检验，将对PRISMA 2020和PRIOR的遵循情况与人类专家先前定义的分数进行比较。在我们的结果中，所有四种大语言模型在分析对PRISMA 2020的遵循情况时表现不佳，高估了遵循百分比（从23%到30%）。对于PRIOR，大语言模型在遵循估计方面的差异较小（从6%到14%），ChatGPT的表现与人类专家相似。这是关于四种常用大语言模型分析对PRISMA 2020和PRIOR遵循情况性能的首份报告。未来关于遵循其他报告指南的研究将有助于健康科学研究。