• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型与系统评价及综述概述(PRISMA 2020和PRIOR)中报告指南的依从性分析

Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).

作者信息

Forero Diego A, Abreu Sandra E, Tovar Blanca E, Oermann Marilyn H

机构信息

School of Health and Sport Sciences, Fundación Universitaria del Área Andina, Bogotá, Colombia.

Psychology Program, Fundación Universitaria del Área Andina, Medellín, Colombia.

出版信息

J Med Syst. 2025 Jun 12;49(1):80. doi: 10.1007/s10916-025-02212-0.

DOI:10.1007/s10916-025-02212-0
PMID:40504403
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12162794/
Abstract

In the context of Evidence-Based Practice (EBP), Systematic Reviews (SRs), Meta-Analyses (MAs) and overview of reviews have become cornerstones for the synthesis of research findings. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 and Preferred Reporting Items for Overviews of Reviews (PRIOR) statements have become major reporting guidelines for SRs/MAs and for overviews of reviews, respectively. In recent years, advances in Generative Artificial Intelligence (genAI) have been proposed as a potential major paradigm shift in scientific research. The main aim of this research was to examine the performance of four LLMs for the analysis of adherence to PRISMA 2020 and PRIOR, in a sample of 20 SRs and 20 overviews of reviews. We tested the free versions of four commonly used LLMs: ChatGPT (GPT-4o), DeepSeek (V3), Gemini (2.0 Flash) and Qwen (2.5 Max). Adherence to PRISMA 2020 and PRIOR was compared with scores defined previously by human experts, using several statistical tests. In our results, all the four LLMs showed a low performance for the analysis of adherence to PRISMA 2020, overestimating the percentage of adherence (from 23 to 30%). For PRIOR, the LLMs presented lower differences in the estimation of adherence (from 6 to 14%) and ChatGPT showed a performance similar to human experts. This is the first report of the performance of four commonly used LLMs for the analysis of adherence to PRISMA 2020 and PRIOR. Future studies of adherence to other reporting guidelines will be helpful in health sciences research.

摘要

在循证实践(EBP)的背景下,系统评价(SRs)、荟萃分析(MAs)和综述概述已成为研究结果综合的基石。系统评价和荟萃分析的首选报告项目(PRISMA)2020以及综述概述的首选报告项目(PRIOR)声明分别已成为SRs/MAs和综述概述的主要报告指南。近年来,生成式人工智能(genAI)的进展被认为是科学研究中潜在的重大范式转变。本研究的主要目的是在20篇系统评价和20篇综述概述的样本中,检验四种大语言模型(LLMs)分析对PRISMA 2020和PRIOR的遵循情况的性能。我们测试了四种常用大语言模型的免费版本:ChatGPT(GPT-4o)、渊亭(V3)、Gemini(2.0 Flash)和通义千问(2.5 Max)。使用多种统计检验,将对PRISMA 2020和PRIOR的遵循情况与人类专家先前定义的分数进行比较。在我们的结果中,所有四种大语言模型在分析对PRISMA 2020的遵循情况时表现不佳,高估了遵循百分比(从23%到30%)。对于PRIOR,大语言模型在遵循估计方面的差异较小(从6%到14%),ChatGPT的表现与人类专家相似。这是关于四种常用大语言模型分析对PRISMA 2020和PRIOR遵循情况性能的首份报告。未来关于遵循其他报告指南的研究将有助于健康科学研究。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/e2ce0723d9fc/10916_2025_2212_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/b6f997a128ab/10916_2025_2212_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/ced5e7ddd0e2/10916_2025_2212_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/bb5ac01c4348/10916_2025_2212_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/e2ce0723d9fc/10916_2025_2212_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/b6f997a128ab/10916_2025_2212_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/ced5e7ddd0e2/10916_2025_2212_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/bb5ac01c4348/10916_2025_2212_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/399f/12162794/e2ce0723d9fc/10916_2025_2212_Fig4_HTML.jpg

相似文献

1
Large Language Models and the Analyses of Adherence to Reporting Guidelines in Systematic Reviews and Overviews of Reviews (PRISMA 2020 and PRIOR).大型语言模型与系统评价及综述概述(PRISMA 2020和PRIOR)中报告指南的依从性分析
J Med Syst. 2025 Jun 12;49(1):80. doi: 10.1007/s10916-025-02212-0.
2
Benchmarking Human-AI collaboration for common evidence appraisal tools.针对常见证据评估工具的人机协作基准测试。
J Clin Epidemiol. 2024 Nov;175:111533. doi: 10.1016/j.jclinepi.2024.111533. Epub 2024 Sep 12.
3
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.用于进行系统评价的大型语言模型:正在兴起,但尚未准备好投入使用——一项范围综述
J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.
4
Evaluations of the uptake and impact of the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) Statement and extensions: a scoping review.评价系统评价和荟萃分析首选报告项目(PRISMA)声明及其扩展的采用和影响:范围综述。
Syst Rev. 2017 Dec 19;6(1):263. doi: 10.1186/s13643-017-0663-8.
5
Quality of reporting of systematic reviews and meta-analyses in emergency medicine based on the PRISMA statement.基于PRISMA声明的急诊医学系统评价和Meta分析报告质量
BMC Emerg Med. 2019 Feb 11;19(1):19. doi: 10.1186/s12873-019-0233-6.
6
Endorsement of PRISMA statement and quality of systematic reviews and meta-analyses published in nursing journals: a cross-sectional study.护理期刊发表的系统评价和荟萃分析对PRISMA声明的认可情况及质量:一项横断面研究
BMJ Open. 2017 Feb 7;7(2):e013905. doi: 10.1136/bmjopen-2016-013905.
7
Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis.大语言模型(ChatGPT)在医疗保健领域的影响:一项综述与证据综合
J Biomed Sci. 2025 May 7;32(1):45. doi: 10.1186/s12929-025-01131-z.
8
Clinical Epidemiology in China series. Paper 3: The methodological and reporting quality of systematic reviews and meta-analyses published by China' researchers in English-language is higher than those published in Chinese-language.中国临床流行病学系列。第 3 篇:中国研究者发表的英文系统评价和荟萃分析的方法学和报告质量高于中文发表的系统评价和荟萃分析。
J Clin Epidemiol. 2021 Dec;140:178-188. doi: 10.1016/j.jclinepi.2021.08.014. Epub 2021 Aug 18.
9
Systematic review adherence to methodological or reporting quality.系统评价对方法学或报告质量的依从性。
Syst Rev. 2017 Jul 19;6(1):131. doi: 10.1186/s13643-017-0527-2.
10
Methodological and reporting quality assessment of systematic reviews and meta-analyses in the association between sleep duration and hypertension.系统评价和荟萃分析在睡眠时间与高血压关联中的方法学和报告质量评估。
Syst Rev. 2024 Aug 6;13(1):211. doi: 10.1186/s13643-024-02622-0.

引用本文的文献

1
Dissecting HealthBench: Disease Spectrum, Clinical Diversity, and Data Insights from Multi-Turn Clinical AI Evaluation Benchmark.剖析HealthBench:多轮临床人工智能评估基准中的疾病谱、临床多样性和数据洞察
J Med Syst. 2025 Jul 28;49(1):100. doi: 10.1007/s10916-025-02232-w.

本文引用的文献

1
Innovative solutions are needed to overcome implementation barriers to using reporting guidelines.需要创新解决方案来克服使用报告指南的实施障碍。
BMJ. 2025 Apr 14;389:r718. doi: 10.1136/bmj.r718.
2
Fine-Tuning Large Language Models for Specialized Use Cases.针对特定用例微调大语言模型。
Mayo Clin Proc Digit Health. 2024 Nov 29;3(1):100184. doi: 10.1016/j.mcpdig.2024.11.005. eCollection 2025 Mar.
3
GPT for RCTs? Using AI to determine adherence to clinical trial reporting guidelines.用于随机对照试验的GPT?利用人工智能确定对临床试验报告指南的遵循情况。
BMJ Open. 2025 Mar 18;15(3):e088735. doi: 10.1136/bmjopen-2024-088735.
4
Large language models for conducting systematic reviews: on the rise, but not yet ready for use-a scoping review.用于进行系统评价的大型语言模型:正在兴起,但尚未准备好投入使用——一项范围综述
J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.
5
The MI-CLAIM-GEN checklist for generative artificial intelligence in health.用于健康领域生成式人工智能的MI-CLAIM-GEN检查表。
Nat Med. 2025 May;31(5):1394-1398. doi: 10.1038/s41591-024-03470-0.
6
A platform for the biomedical application of large language models.一个用于大语言模型生物医学应用的平台。
Nat Biotechnol. 2025 Feb;43(2):166-169. doi: 10.1038/s41587-024-02534-3.
7
From promise to practice: challenges and pitfalls in the evaluation of large language models for data extraction in evidence synthesis.从承诺到实践:证据综合中用于数据提取的大语言模型评估中的挑战与陷阱
BMJ Evid Based Med. 2024 Dec 20. doi: 10.1136/bmjebm-2024-113199.
8
Evaluating the Performance of ChatGPT-4o in Risk of Bias Assessments.评估ChatGPT-4o在偏倚风险评估中的表现。
J Evid Based Med. 2024 Dec;17(4):700-702. doi: 10.1111/jebm.12662. Epub 2024 Dec 15.
9
Minimum Reporting Items for Clear Evaluation of Accuracy Reports of Large Language Models in Healthcare (MI-CLEAR-LLM).用于清晰评估医疗保健领域大语言模型准确性报告的最低报告项目(MI-CLEAR-LLM)。
Korean J Radiol. 2024 Oct;25(10):865-868. doi: 10.3348/kjr.2024.0843.
10
Reporting quality of meta-analyses in acupuncture: Investigating adherence to the PRISMA statement.针灸荟萃分析报告质量:调查对 PRISMA 声明的遵守情况。
Medicine (Baltimore). 2024 Sep 27;103(39):e39933. doi: 10.1097/MD.0000000000039933.