Helms Andersen T, Marcussen T M, Termannsen A D, Lawaetz T W H, Nørgaard O
Copenhagen University Hospital-Steno Diabetes Center Copenhagen Herlev Denmark.
Cochrane Evid Synth Methods. 2025 Jul 14;3(4):e70036. doi: 10.1002/cesm.70036. eCollection 2025 Jul.
Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.
To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.
Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.
Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: -5.2%-11.9%, = 0.445).
AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.
系统评价至关重要,但耗时且昂贵。大语言模型(LLMs)和人工智能(AI)工具可能会使数据提取自动化,但尚未针对不同的评价类型测试过全面的工作流程。
评估Elicit和ChatGPT从期刊文章中提取数据以替代系统评价中两名人类数据提取员之一的能力。
将从三项系统评价(共30篇文章)中人工提取的数据与Elicit和ChatGPT提取的数据进行比较。这些人工智能工具提取了人群特征、研究设计和评价特定变量。以人工双份提取的数据作为金标准计算性能指标,随后进行详细的误差分析。
Elicit的精确率、召回率和F1分数均为92%,ChatGPT的分别为91%、89%和90%。研究设计(Elicit:100%;ChatGPT:90%)和人群特征(Elicit:100%;ChatGPT:97%)的召回率最高,而评价特定变量在Elicit中的召回率为77%,在ChatGPT中为80%。Elicit有4例假象实例,ChatGPT有3例。两种人工智能工具的性能无显著差异(召回率差异:3.3个百分点,95%CI:-5.2%-11.9%,P=0.445)。
与人类评审员相比,人工智能工具在数据提取方面表现出较高且相似的性能,尤其是对于标准化变量。误差分析显示4%的数据点存在假象。我们建议采用人工智能辅助提取来替代第二名人类提取员,第二名人类提取员则专注于协调人工智能与第一名人类提取员之间的差异。