• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

使用人工智能工具作为系统评价中数据提取的二审人员:两种人工智能工具与人类评审员的性能比较

Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.

作者信息

Helms Andersen T, Marcussen T M, Termannsen A D, Lawaetz T W H, Nørgaard O

机构信息

Copenhagen University Hospital-Steno Diabetes Center Copenhagen Herlev Denmark.

出版信息

Cochrane Evid Synth Methods. 2025 Jul 14;3(4):e70036. doi: 10.1002/cesm.70036. eCollection 2025 Jul.

DOI:10.1002/cesm.70036
PMID:40661122
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12257877/
Abstract

BACKGROUND

Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.

OBJECTIVE

To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.

METHODS

Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.

RESULTS

Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: -5.2%-11.9%,  = 0.445).

CONCLUSION

AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.

摘要

背景

系统评价至关重要,但耗时且昂贵。大语言模型(LLMs)和人工智能(AI)工具可能会使数据提取自动化,但尚未针对不同的评价类型测试过全面的工作流程。

目的

评估Elicit和ChatGPT从期刊文章中提取数据以替代系统评价中两名人类数据提取员之一的能力。

方法

将从三项系统评价(共30篇文章)中人工提取的数据与Elicit和ChatGPT提取的数据进行比较。这些人工智能工具提取了人群特征、研究设计和评价特定变量。以人工双份提取的数据作为金标准计算性能指标,随后进行详细的误差分析。

结果

Elicit的精确率、召回率和F1分数均为92%,ChatGPT的分别为91%、89%和90%。研究设计(Elicit:100%;ChatGPT:90%)和人群特征(Elicit:100%;ChatGPT:97%)的召回率最高,而评价特定变量在Elicit中的召回率为77%,在ChatGPT中为80%。Elicit有4例假象实例,ChatGPT有3例。两种人工智能工具的性能无显著差异(召回率差异:3.3个百分点,95%CI:-5.2%-11.9%,P=0.445)。

结论

与人类评审员相比,人工智能工具在数据提取方面表现出较高且相似的性能,尤其是对于标准化变量。误差分析显示4%的数据点存在假象。我们建议采用人工智能辅助提取来替代第二名人类提取员,第二名人类提取员则专注于协调人工智能与第一名人类提取员之间的差异。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/c7c58fc4ed33/CESM-3-e70036-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/e14d92c727ec/CESM-3-e70036-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/87b59d3d19aa/CESM-3-e70036-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/c7c58fc4ed33/CESM-3-e70036-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/e14d92c727ec/CESM-3-e70036-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/87b59d3d19aa/CESM-3-e70036-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/7f9f/12257877/c7c58fc4ed33/CESM-3-e70036-g002.jpg

相似文献

1
Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.使用人工智能工具作为系统评价中数据提取的二审人员:两种人工智能工具与人类评审员的性能比较
Cochrane Evid Synth Methods. 2025 Jul 14;3(4):e70036. doi: 10.1002/cesm.70036. eCollection 2025 Jul.
2
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
3
Eliciting adverse effects data from participants in clinical trials.从临床试验参与者中获取不良反应数据。
Cochrane Database Syst Rev. 2018 Jan 16;1(1):MR000039. doi: 10.1002/14651858.MR000039.pub2.
4
Falls prevention interventions for community-dwelling older adults: systematic review and meta-analysis of benefits, harms, and patient values and preferences.社区居住的老年人跌倒预防干预措施:系统评价和荟萃分析的益处、危害以及患者的价值观和偏好。
Syst Rev. 2024 Nov 26;13(1):289. doi: 10.1186/s13643-024-02681-3.
5
Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19.在基层医疗机构或医院门诊环境中,如果患者出现以下症状和体征,可判断其是否患有 COVID-19。
Cochrane Database Syst Rev. 2022 May 20;5(5):CD013665. doi: 10.1002/14651858.CD013665.pub3.
6
Nutritional interventions for survivors of childhood cancer.儿童癌症幸存者的营养干预措施。
Cochrane Database Syst Rev. 2016 Aug 22;2016(8):CD009678. doi: 10.1002/14651858.CD009678.pub2.
7
Artificial intelligence for diagnosing exudative age-related macular degeneration.人工智能在渗出性年龄相关性黄斑变性诊断中的应用。
Cochrane Database Syst Rev. 2024 Oct 17;10(10):CD015522. doi: 10.1002/14651858.CD015522.pub2.
8
Artificial intelligence as a tool for data extraction is not fully reliable compared to manual data extraction.与人工数据提取相比,将人工智能作为数据提取工具并不完全可靠。
J Dent. 2025 Sep;160:105846. doi: 10.1016/j.jdent.2025.105846. Epub 2025 May 29.
9
Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis.ChatGPT 在全球医学执照考试不同版本中的表现:系统评价和荟萃分析。
J Med Internet Res. 2024 Jul 25;26:e60807. doi: 10.2196/60807.
10
Systemic pharmacological treatments for chronic plaque psoriasis: a network meta-analysis.系统性药理学治疗慢性斑块状银屑病:网络荟萃分析。
Cochrane Database Syst Rev. 2021 Apr 19;4(4):CD011535. doi: 10.1002/14651858.CD011535.pub4.

本文引用的文献

1
Using artificial intelligence tools to automate data extraction for living evidence syntheses.使用人工智能工具实现实时证据综合的数据提取自动化。
PLoS One. 2025 Apr 3;20(4):e0320151. doi: 10.1371/journal.pone.0320151. eCollection 2025.
2
Collaborative large language models for automated data extraction in living systematic reviews.用于活体系统评价中自动数据提取的协作式大语言模型
J Am Med Inform Assoc. 2025 Apr 1;32(4):638-647. doi: 10.1093/jamia/ocae325.
3
Leveraging artificial intelligence to enhance systematic reviews in health research: advanced tools and challenges.
利用人工智能增强健康研究中的系统评价:高级工具和挑战。
Syst Rev. 2024 Oct 25;13(1):269. doi: 10.1186/s13643-024-02682-2.
4
Performance of two large language models for data extraction in evidence synthesis.两种大型语言模型在证据综合数据提取中的性能比较。
Res Synth Methods. 2024 Sep;15(5):818-824. doi: 10.1002/jrsm.1732. Epub 2024 Jun 19.
5
Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages.大型语言模型能否在系统评价中取代人类?评估 GPT-4 从多种语言的同行评议文献和灰色文献中进行筛选和提取数据的效果。
Res Synth Methods. 2024 Jul;15(4):616-626. doi: 10.1002/jrsm.1715. Epub 2024 Mar 14.
6
Data extraction for evidence synthesis using a large language model: A proof-of-concept study.使用大型语言模型进行证据综合的数据提取:概念验证研究。
Res Synth Methods. 2024 Jul;15(4):576-589. doi: 10.1002/jrsm.1710. Epub 2024 Mar 3.
7
Information needs for GPs on type 2 diabetes in Western countries: a systematic review.西方国家全科医生对 2 型糖尿病的信息需求:系统评价。
Br J Gen Pract. 2024 Oct 31;74(748):e749-e757. doi: 10.3399/BJGP.2023.0531. Print 2024 Nov.
8
Methods for using Bing's AI-powered search engine for data extraction for a systematic review.使用必应的人工智能搜索引擎进行数据提取以进行系统评价的方法。
Res Synth Methods. 2024 Mar;15(2):347-353. doi: 10.1002/jrsm.1689. Epub 2023 Dec 8.
9
Hallucination or Confabulation? Neuroanatomy as metaphor in Large Language Models.幻觉还是虚构?大语言模型中作为隐喻的神经解剖学。
PLOS Digit Health. 2023 Nov 1;2(11):e0000388. doi: 10.1371/journal.pdig.0000388. eCollection 2023 Nov.
10
Chatbot Confabulations Are Not Hallucinations.聊天机器人的虚构并非幻觉。
JAMA Intern Med. 2023 Oct 1;183(10):1177. doi: 10.1001/jamainternmed.2023.4231.