Lieberum Judith-Lisa, Toews Markus, Metzendorf Maria-Inti, Heilmeyer Felix, Siemens Waldemar, Haverkamp Christian, Böhringer Daniel, Meerpohl Joerg J, Eisele-Metzger Angelika
Eye Clinic, Medical Center - University of Freiburg/Medical Faculty - University of Freiburg, Freiburg, Germany.
Institute for Evidence in Medicine, Medical Center - University of Freiburg/Medical Faculty - University of Freiburg, Freiburg, Germany.
J Clin Epidemiol. 2025 May;181:111746. doi: 10.1016/j.jclinepi.2025.111746. Epub 2025 Feb 26.
Machine learning promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. We aimed at providing an overview of LLM applications in SR conduct in health research.
We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: February 26, 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review that has not yet identified LLM applications to support SRs. Two reviewers independently screened studies for eligibility; after piloting, 1 reviewer extracted data, checked by another.
Our database search yielded 8054 hits, and we identified 33 articles from our hand search. We finally included 37 articles on LLM support. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n = 15, 41%), study selection (n = 14, 38%), and data extraction (n = 11, 30%). The mostly recurring LLM was Generative Pretrained Transformer (GPT) (n = 33, 89%). Validation studies were predominant (n = 21, 57%). In half of the studies, authors evaluated LLM use as promising (n = 20, 54%), one-quarter as neutral (n = 9, 24%) and one-fifth as nonpromising (n = 8, 22%).
Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.
Systematic reviews are a crucial tool in health research where experts carefully collect and analyze all available evidence on a specific research question. Creating these reviews is typically time- and resource-intensive, often taking months or even years to complete, as researchers must thoroughly search, evaluate, and synthesize an immense number of scientific studies. For the present article, we conducted a review to understand how new artificial intelligence (AI) tools, specifically large language models (LLMs) like Generative Pretrained Transformer (GPT), can be used to help create systematic reviews in health research. We searched multiple scientific databases and finally found 37 relevant articles. We found that LLMs have been tested to help with various parts of the systematic review process, particularly in 3 main areas: searching scientific literature (41% of studies), selecting relevant studies (38%), and extracting important information from these studies (30%). GPT was the most commonly used LLM, appearing in 89% of the studies. Most of the research (57%) focused on testing whether these AI tools actually work as intended in this context of systematic review production. The results were mixed: about half of the studies found LLMs promising, a quarter were neutral, and one-fifth found them not promising. While LLMs show potential for making the systematic review process more efficient, there is still a lack of fully tested and validated applications. However, the increasing number of studies in this field suggests that these AI tools are becoming increasingly important in creating systematic reviews.
机器学习有望在系统性综述(SRs)的创建过程中提供多方面的帮助。最近,大语言模型(LLMs)形式的进一步发展及其在SR开展中的应用引起了关注。我们旨在概述LLMs在健康研究中SR开展方面的应用情况。
我们系统地检索了MEDLINE、科学网、IEEE Xplore、ACM数字图书馆、欧洲PMC(预印本)、谷歌学术,并进行了额外的手工检索(最后一次检索时间:2024年2月26日)。我们纳入了2021年4月起发表的英文或德文科学文章,这些文章基于一项尚未确定支持SRs的LLMs应用的映射综述结果。两名评审员独立筛选研究的合格性;在进行预试验后,由一名评审员提取数据,并由另一名评审员进行检查。
我们的数据库检索得到8054条命中结果,通过手工检索识别出33篇文章。我们最终纳入了37篇关于LLMs支持的文章。LLMs方法覆盖了13个定义的SR步骤中的10个,最常见的是文献检索(n = 15,41%)、研究选择(n = 14,38%)和数据提取(n = 11,30%)。最常出现的LLM是生成式预训练变换器(GPT)(n = 33,89%)。验证研究占主导地位(n = 21,57%)。在一半的研究中,作者评估LLMs的使用前景良好(n = 20,54%),四分之一为中性(n = 9,24%),五分之一为无前景(n = 8,22%)。
尽管LLMs在支持SR创建方面显示出前景,但往往缺乏完全成熟或经过验证的应用。关于用于证据综合生成的LLMs的研究迅速增加,凸显了它们日益增长的相关性。
系统性综述是健康研究中的关键工具,专家们会仔细收集和分析关于特定研究问题的所有可用证据。创建这些综述通常需要大量时间和资源,往往需要数月甚至数年才能完成,因为研究人员必须全面搜索、评估和综合大量科学研究。在本文中,我们进行了一项综述,以了解新的人工智能(AI)工具,特别是像生成式预训练变换器(GPT)这样的大语言模型(LLMs),如何用于帮助在健康研究中创建系统性综述。我们搜索了多个科学数据库,最终找到了37篇相关文章。我们发现LLMs已被测试用于帮助系统性综述过程的各个部分,特别是在三个主要领域:搜索科学文献(41%的研究)、选择相关研究(38%)以及从这些研究中提取重要信息(30%)。GPT是最常用的LLM,出现在89%的研究中。大多数研究(57%)集中于测试这些AI工具在系统性综述生成的背景下是否实际按预期工作。结果不一:约一半的研究认为LLMs有前景,四分之一持中性态度,五分之一认为无前景。虽然LLMs显示出使系统性综述过程更高效的潜力,但仍缺乏经过充分测试和验证的应用。然而,该领域研究数量的增加表明,这些AI工具在创建系统性综述方面正变得越来越重要。