Khan Muhammad Ali, Ayub Umair, Naqvi Syed Arsalan Ahmed, Khakwani Kaneez Zahra Rubab, Sipra Zaryab Bin Riaz, Raina Ammad, Zou Sihan, He Huan, Hossein Seyyed Amir, Hasan Bashar, Rumble R Bryan, Bitterman Danielle S, Warner Jeremy L, Zou Jia, Tevaarwerk Amye J, Leventakos Konstantinos, Kehl Kenneth L, Palmer Jeanne M, Murad M Hassan, Baral Chitta, Riaz Irbaz Bin
Department of Medicine, Mayo Clinic, Phoenix, United States of America.
Department of Medicine, University of Arizona, Tucson, United States of America.
medRxiv. 2024 Sep 23:2024.09.20.24314108. doi: 10.1101/2024.09.20.24314108.
Data extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world two-reviewer process.
A dataset of 10 clinical trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n=5) and held-out test sets (n=17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the two LLMs were compared for concordance. In instances with discordance, original responses from each LLM were provided to the other LLM for cross-critique. Evaluation metrics, including accuracy, were used to assess performance against the manually curated gold standard.
In the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, with an increase in accuracy to 0.76.
Concordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy.
Large language models, when simulated in a collaborative, two-reviewer workflow, can extract data with reasonable performance, enabling truly 'living' systematic reviews.
从已发表的文献中提取数据是进行实时系统评价(LSR)最费力的步骤。我们旨在构建一个可推广的、利用大语言模型(LLM)的自动化数据提取工作流程,该流程模仿现实世界中的双评审过程。
使用来自已发表的LSR的10项临床试验(22篇出版物)的数据集,重点关注与试验、人群和结局数据相关的23个变量。该数据集被分为提示开发集(n = 5)和保留测试集(n = 17)。使用GPT - 4 - turbo和Claude - 3 - Opus进行数据提取。比较两个LLM的回答的一致性。在出现不一致的情况下,将每个LLM的原始回答提供给另一个LLM进行交叉评审。使用包括准确性在内的评估指标来评估相对于人工整理的金标准的性能。
在提示开发集中,110个(96%)回答是一致的,相对于金标准的准确性达到0.99。在测试集中,342个(87%)回答是一致的。一致回答的准确性为0.94。对于GPT - 4 - turbo,不一致回答的准确性为0.41,对于Claude - 3 - Opus为0.50。在49个不一致回答中,25个(51%)在交叉评审后变得一致,准确性提高到0.76。
LLM的一致回答可能是准确的。在出现不一致回答的情况下,交叉评审可以进一步提高准确性。
当在协作的双评审工作流程中进行模拟时,大语言模型可以以合理的性能提取数据,从而实现真正的“实时系统评价”。