Department of Methodology and Statistics, Faculty of Social and Behavioral Sciences, Utrecht University, Utrecht, Netherlands.
Department of Research and Data Management Services, Information Technology Services, Utrecht University, Utrecht, The Netherlands.
Syst Rev. 2023 Jun 20;12(1):100. doi: 10.1186/s13643-023-02257-7.
Conducting a systematic review demands a significant amount of effort in screening titles and abstracts. To accelerate this process, various tools that utilize active learning have been proposed. These tools allow the reviewer to interact with machine learning software to identify relevant publications as early as possible. The goal of this study is to gain a comprehensive understanding of active learning models for reducing the workload in systematic reviews through a simulation study.
The simulation study mimics the process of a human reviewer screening records while interacting with an active learning model. Different active learning models were compared based on four classification techniques (naive Bayes, logistic regression, support vector machines, and random forest) and two feature extraction strategies (TF-IDF and doc2vec). The performance of the models was compared for six systematic review datasets from different research areas. The evaluation of the models was based on the Work Saved over Sampling (WSS) and recall. Additionally, this study introduces two new statistics, Time to Discovery (TD) and Average Time to Discovery (ATD).
The models reduce the number of publications needed to screen by 91.7 to 63.9% while still finding 95% of all relevant records (WSS@95). Recall of the models was defined as the proportion of relevant records found after screening 10% of of all records and ranges from 53.6 to 99.8%. The ATD values range from 1.4% till 11.7%, which indicate the average proportion of labeling decisions the researcher needs to make to detect a relevant record. The ATD values display a similar ranking across the simulations as the recall and WSS values.
Active learning models for screening prioritization demonstrate significant potential for reducing the workload in systematic reviews. The Naive Bayes + TF-IDF model yielded the best results overall. The Average Time to Discovery (ATD) measures performance of active learning models throughout the entire screening process without the need for an arbitrary cut-off point. This makes the ATD a promising metric for comparing the performance of different models across different datasets.
进行系统评价需要在筛选标题和摘要上投入大量精力。为了加速这一过程,已经提出了各种利用主动学习的工具。这些工具允许评审员与机器学习软件进行交互,尽早识别相关文献。本研究的目的是通过模拟研究全面了解用于减少系统评价工作量的主动学习模型。
模拟研究模仿了人工评审员在与主动学习模型交互时筛选记录的过程。基于四种分类技术(朴素贝叶斯、逻辑回归、支持向量机和随机森林)和两种特征提取策略(TF-IDF 和 doc2vec)比较了不同的主动学习模型。基于来自不同研究领域的六个系统评价数据集对模型的性能进行了比较。基于工作节省抽样(WSS)和召回率对模型进行了评估。此外,本研究还引入了两个新的统计指标,发现时间(TD)和平均发现时间(ATD)。
模型将需要筛选的文献数量减少了 91.7%至 63.9%,同时仍然找到了 95%的所有相关记录(WSS@95)。模型的召回率定义为筛选所有记录的 10%后发现的相关记录的比例,范围从 53.6%到 99.8%。ATD 值范围从 1.4%到 11.7%,这表示研究人员发现一个相关记录所需的平均标签决策比例。ATD 值在整个模拟中与召回率和 WSS 值的排序相似。
用于筛选优先级的主动学习模型在减少系统评价工作量方面具有显著的潜力。朴素贝叶斯+TF-IDF 模型总体上取得了最好的结果。平均发现时间(ATD)在整个筛选过程中衡量主动学习模型的性能,而无需任意截止点。这使得 ATD 成为比较不同模型在不同数据集上性能的有前途的指标。