Suppr超能文献

用于活体系统评价中自动数据提取的协作式大语言模型

Collaborative large language models for automated data extraction in living systematic reviews.

作者信息

Khan Muhammad Ali, Ayub Umair, Naqvi Syed Arsalan Ahmed, Khakwani Kaneez Zahra Rubab, Sipra Zaryab Bin Riaz, Raina Ammad, Zhou Sihan, He Huan, Saeidi Amir, Hasan Bashar, Rumble Robert Bryan, Bitterman Danielle S, Warner Jeremy L, Zou Jia, Tevaarwerk Amye J, Leventakos Konstantinos, Kehl Kenneth L, Palmer Jeanne M, Murad Mohammad Hassan, Baral Chitta, Riaz Irbaz Bin

机构信息

Department of Medicine, Mayo Clinic, Phoenix, AZ, 85054, United States.

Department of Medicine, University of Arizona, Tucson, AZ, 85721, United States.

出版信息

J Am Med Inform Assoc. 2025 Apr 1;32(4):638-647. doi: 10.1093/jamia/ocae325.

Abstract

OBJECTIVE

Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process.

MATERIALS AND METHODS

A dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance.

RESULTS

In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76.

DISCUSSION

Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.

CONCLUSION

Large language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly "living" systematic reviews.

摘要

目的

从已发表的文献中提取数据是进行实时系统评价(LSR)最费力的步骤。我们旨在构建一种可推广的自动化数据提取工作流程,利用大语言模型(LLM)来模拟真实世界中两名评审员的流程。

材料与方法

使用了一个已发表的LSR中的10项试验(22篇出版物)的数据集,重点关注与试验、人群和结果数据相关的23个变量。该数据集被分为提示开发集(n = 5)和保留测试集(n = 17)。使用GPT-4-turbo和Claude-3-Opus进行数据提取。如果两个LLM对于给定变量的回答相同,则认为它们的回答一致。每个LLM的不一致回答被提供给另一个LLM进行交叉评审。计算准确率,即正确回答总数除以回答总数,以评估性能。

结果

在提示开发集中,110个(96%)回答一致,相对于金标准的准确率达到0.99。在测试集中,342个(87%)回答一致。一致回答的准确率为0.94。GPT-4-turbo的不一致回答准确率为0.41,Claude-3-Opus的为0.50。在49个不一致回答中,25个(51%)在交叉评审后变得一致,准确率提高到0.76。

讨论

LLM的一致回答可能是准确的。在回答不一致的情况下,交叉评审可以进一步提高准确率。

结论

当在协作的两名评审员工作流程中进行模拟时,大语言模型可以以合理的性能提取数据,实现真正的“实时”系统评价。

相似文献

本文引用的文献

6
Large language models encode clinical knowledge.大语言模型编码临床知识。
Nature. 2023 Aug;620(7972):172-180. doi: 10.1038/s41586-023-06291-2. Epub 2023 Jul 12.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验