Suppr超能文献

使用大型语言模型对临床综述进行自动化论文筛选:数据分析研究。

Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study.

机构信息

Cumming School of Medicine, University of Calgary, Calgary, AB, Canada.

Temerty Faculty of Medicine, University of Toronto, Toronto, AB, Canada.

出版信息

J Med Internet Res. 2024 Jan 12;26:e48996. doi: 10.2196/48996.

Abstract

BACKGROUND

The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources.

OBJECTIVE

This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers.

METHODS

We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts.

RESULTS

Our results show an accuracy of 0.91, a macro F-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications.

CONCLUSIONS

Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.

摘要

背景

系统评价临床研究论文是一个劳动密集型和耗时的过程,通常涉及筛选数千个标题和摘要。这个过程的准确性和效率对综述的质量和随后的医疗保健决策至关重要。传统的方法主要依赖于人工评审员,通常需要大量的时间和资源投入。

目的

本研究旨在评估 OpenAI 生成式预训练转换器(GPT)和 GPT-4 应用程序编程接口(API)在准确高效地从真实临床综述数据集识别相关标题和摘要方面的性能,并将其与 2 名独立人工评审员的真实标签进行比较。

方法

我们引入了一种使用 ChatGPT 和 GPT-4 API 筛选临床综述标题和摘要的新工作流程。创建了一个 Python 脚本,使用自然语言向 API 发出调用,并使用至少 2 名人工评审员筛选标题和摘要数据集。我们将我们的模型与 6 篇综述论文中的人工评审论文进行了比较,共筛选了超过 24000 个标题和摘要。

结果

我们的结果显示,准确性为 0.91,宏 F 分数为 0.60,排除论文的敏感性为 0.91,纳入论文的敏感性为 0.76。2 名独立人工筛查者之间的组内变异系数为 κ=0.46,我们提出的方法与基于共识的人工决策之间的流行率和偏倚调整 κ 为 κ=0.96。在随机选择的论文子集上,GPT 模型展示了提供决策推理的能力,并在被要求解释其错误分类的推理时纠正了最初的决策。

结论

大型语言模型有可能简化临床审查过程,为研究人员节省宝贵的时间和精力,并提高临床审查的整体质量。通过优先考虑工作流程,并作为研究人员和评审员的辅助工具,而不是替代工具,GPT-4 等模型可以提高效率,并在医学研究中得出更准确、更可靠的结论。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1b37/10818236/9e2f72409d22/jmir_v26i1e48996_fig1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验