• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。

Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.

机构信息

Trinity Centre for Global Health, Trinity College Dublin, Dublin, Ireland.

School of Psychology, Trinity College Dublin, Dublin, Ireland.

出版信息

Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.

DOI:10.1002/jrsm.1715
Abstract

Systematic reviews are vital for guiding practice, research and policy, although they are often slow and labour-intensive. Large language models (LLMs) could speed up and automate systematic reviews, but their performance in such tasks has yet to be comprehensively evaluated against humans, and no study has tested Generative Pre-Trained Transformer (GPT)-4, the biggest LLM so far. This pre-registered study uses a "human-out-of-the-loop" approach to evaluate GPT-4's capability in title/abstract screening, full-text review and data extraction across various literature types and languages. Although GPT-4 had accuracy on par with human performance in some tasks, results were skewed by chance agreement and dataset imbalance. Adjusting for these caused performance scores to drop across all stages: for data extraction, performance was moderate, and for screening, it ranged from none in highly balanced literature datasets (1:1) to moderate in those datasets where the ratio of inclusion to exclusion in studies was imbalanced (1:3). When screening full-text literature using highly reliable prompts, GPT-4's performance was more robust, reaching "human-like" levels. Although our findings indicate that, currently, substantial caution should be exercised if LLMs are being used to conduct systematic reviews, they also offer preliminary evidence that, for certain review tasks delivered under specific conditions, LLMs can rival human performance.

摘要

系统评价对于指导实践、研究和政策至关重要,但它们通常耗时且劳动强度大。大型语言模型(LLMs)可以加快和自动化系统评价,但它们在这些任务中的性能尚未与人的表现进行全面评估,也没有研究测试过迄今为止最大的 LLM——Generative Pre-Trained Transformer (GPT)-4。本预先注册的研究使用“人机分离”方法来评估 GPT-4 在各种文献类型和语言中的标题/摘要筛选、全文审查和数据提取能力。尽管 GPT-4 在某些任务中的准确性与人类表现相当,但结果受到偶然一致性和数据集不平衡的影响。对这些因素进行调整导致所有阶段的性能得分下降:在数据提取方面,性能为中等,在筛选方面,在研究中纳入与排除的比例不平衡(1:3)的数据集上表现为中等,而在高度平衡的文献数据集(1:1)上则表现为无。当使用高度可靠的提示筛选全文文献时,GPT-4 的性能更加稳健,达到“类人”水平。虽然我们的研究结果表明,如果要使用 LLM 进行系统评价,目前应谨慎行事,但它们也初步表明,在某些特定条件下提供的特定审查任务中,LLM 可以与人类表现相媲美。

相似文献

1
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
2
Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study.生成式大语言模型与同行用户对解释非专业患者实验室检测结果的答案质量比较:评估研究。
J Med Internet Res. 2024 Apr 17;26:e56655. doi: 10.2196/56655.
3
Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。
J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.
4
Implementation and evaluation of an additional GPT-4-based reviewer in PRISMA-based medical systematic literature reviews.基于 PRISMA 的医学系统文献综述中额外的基于 GPT-4 的审查员的实施和评估。
Int J Med Inform. 2024 Sep;189:105531. doi: 10.1016/j.ijmedinf.2024.105531. Epub 2024 Jun 26.
5
An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study.零样本临床自然语言处理中大型语言模型提示策略的实证评估:算法开发与验证研究
JMIR Med Inform. 2024 Apr 8;12:e55318. doi: 10.2196/55318.
6
A Language Model-Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study.基于语言模型的模拟患者与自动化反馈的病史采集:前瞻性研究。
JMIR Med Educ. 2024 Aug 16;10:e59213. doi: 10.2196/59213.
7
Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room.评估最先进的大型语言模型在预测急诊入院方面的准确性。
J Am Med Inform Assoc. 2024 Sep 1;31(9):1921-1928. doi: 10.1093/jamia/ocae103.
8
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
9
The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.人工智能的快速发展:GPT-4 在骨科手术委员会问题上的表现。
Orthopedics. 2024 Mar-Apr;47(2):e85-e89. doi: 10.3928/01477447-20230922-05. Epub 2023 Sep 27.
10
Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis.幻觉发生率和 ChatGPT 与 Bard 用于系统评价的参考准确性:比较分析。
J Med Internet Res. 2024 May 22;26:e53164. doi: 10.2196/53164.

引用本文的文献

1
Is Large Language Model-Assisted Citation Screening Feasible in a Scoping Review on Nonpharmacological Interventions for Delirium in Patients With Cancer?大语言模型辅助的文献筛选在癌症患者谵妄非药物干预的范围综述中是否可行?
Cureus. 2025 Aug 13;17(8):e90026. doi: 10.7759/cureus.90026. eCollection 2025 Aug.
2
Critical Assessment of Large Language Models' (ChatGPT) Performance in Data Extraction for Systematic Reviews: Exploratory Study.大型语言模型(ChatGPT)在系统评价数据提取中的性能批判性评估:探索性研究
JMIR AI. 2025 Sep 11;4:e68097. doi: 10.2196/68097.
3
How Well Do ChatGPT and Claude Perform in Study Selection for Systematic Review in Obstetrics.
ChatGPT和Claude在产科系统评价的研究选择中表现如何。
J Med Syst. 2025 Sep 4;49(1):110. doi: 10.1007/s10916-025-02246-4.
4
Artificial intelligence for the science of evidence synthesis: how good are AI-powered tools for automatic literature screening?用于证据综合科学的人工智能:人工智能驱动的自动文献筛选工具效果如何?
BMC Med Res Methodol. 2025 Aug 25;25(1):199. doi: 10.1186/s12874-025-02644-9.
5
Assessing the practicality of using freely available AI-based GPT tools for coach learning and athlete development.评估使用免费的基于人工智能的GPT工具促进教练学习和运动员发展的实用性。
Front Sports Act Living. 2025 Jul 29;7:1627685. doi: 10.3389/fspor.2025.1627685. eCollection 2025.
6
Evaluating a Customized Version of ChatGPT for Systematic Review Data Extraction in Health Research: Development and Usability Study.评估定制版ChatGPT在健康研究系统评价数据提取中的应用:开发与可用性研究
JMIR Form Res. 2025 Aug 11;9:e68666. doi: 10.2196/68666.
7
A foundation model for human-AI collaboration in medical literature mining.医学文献挖掘中人类与人工智能协作的基础模型。
ArXiv. 2025 Jan 27:arXiv:2501.16255v1.
8
Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.使用人工智能工具作为系统评价中数据提取的二审人员:两种人工智能工具与人类评审员的性能比较
Cochrane Evid Synth Methods. 2025 Jul 14;3(4):e70036. doi: 10.1002/cesm.70036. eCollection 2025 Jul.
9
Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study.在公共政策文献综述中使用GPT-4进行标题和摘要筛选:一项可行性研究。
Cochrane Evid Synth Methods. 2025 May 22;3(3):e70031. doi: 10.1002/cesm.70031. eCollection 2025 May.
10
Over 50 years of research on social disparities in pain and pain treatment: a scoping review of reviews.50 多年来关于疼痛及疼痛治疗中社会差异的研究:一项综述的范围综述
Pain. 2025 Jun 24. doi: 10.1097/j.pain.0000000000003676.