使用人工智能工具作为系统评价中数据提取的二审人员：两种人工智能工具与人类评审员的性能比较

Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.

作者信息

Helms Andersen T, Marcussen T M, Termannsen A D, Lawaetz T W H, Nørgaard O

机构信息

Copenhagen University Hospital-Steno Diabetes Center Copenhagen Herlev Denmark.

出版信息

Cochrane Evid Synth Methods. 2025 Jul 14;3(4):e70036. doi: 10.1002/cesm.70036. eCollection 2025 Jul.

DOI:10.1002/cesm.70036

PMID:40661122

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12257877/

Abstract

BACKGROUND

Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.

OBJECTIVE

To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.

METHODS

Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.

RESULTS

Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: -5.2%-11.9%, = 0.445).

CONCLUSION

AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.

摘要

背景

系统评价至关重要，但耗时且昂贵。大语言模型（LLMs）和人工智能（AI）工具可能会使数据提取自动化，但尚未针对不同的评价类型测试过全面的工作流程。

目的

评估Elicit和ChatGPT从期刊文章中提取数据以替代系统评价中两名人类数据提取员之一的能力。

方法

将从三项系统评价（共30篇文章）中人工提取的数据与Elicit和ChatGPT提取的数据进行比较。这些人工智能工具提取了人群特征、研究设计和评价特定变量。以人工双份提取的数据作为金标准计算性能指标，随后进行详细的误差分析。

结果

Elicit的精确率、召回率和F1分数均为92%，ChatGPT的分别为91%、89%和90%。研究设计（Elicit：100%；ChatGPT：90%）和人群特征（Elicit：100%；ChatGPT：97%）的召回率最高，而评价特定变量在Elicit中的召回率为77%，在ChatGPT中为80%。Elicit有4例假象实例，ChatGPT有3例。两种人工智能工具的性能无显著差异（召回率差异：3.3个百分点，95%CI：-5.2%-11.9%，P=0.445）。

结论

与人类评审员相比，人工智能工具在数据提取方面表现出较高且相似的性能，尤其是对于标准化变量。误差分析显示4%的数据点存在假象。我们建议采用人工智能辅助提取来替代第二名人类提取员，第二名人类提取员则专注于协调人工智能与第一名人类提取员之间的差异。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用人工智能工具作为系统评价中数据提取的二审人员：两种人工智能工具与人类评审员的性能比较

Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献

本文引用的文献

使用人工智能工具作为系统评价中数据提取的二审人员：两种人工智能工具与人类评审员的性能比较

Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSION

背景

目的

方法

结果

结论

相似文献

本文引用的文献