Lange Toni, Schwarzer Guido, Datzmann Thomas, Binder Harald
Center for Evidence-based Healthcare, University Hospital Carl Gustav Carus and Faculty of Medicine Carl Gustav Carus, Technische Universität Dresden, Germany.
Faculty of Medicine and Medical Center, Institute of Medical Biometry and Statistics, University of Freiburg, Freiburg, Germany.
Res Synth Methods. 2021 Jul;12(4):506-515. doi: 10.1002/jrsm.1486. Epub 2021 Mar 28.
Updating systematic reviews is often a time-consuming process that involves a lot of human effort and is therefore not conducted as often as it should be. The aim of our research project was to explore the potential of machine learning methods to reduce human workload. Furthermore, we evaluated the performance of deep learning methods in comparison to more established machine learning methods. We used three available reviews of diagnostic test studies as the data set. In order to identify relevant publications, we used typical text pre-processing methods. The reference standard for the evaluation was the human-consensus based on binary classification (inclusion, exclusion). For the evaluation of the models, various scenarios were generated using a grid of combinations of data preprocessing steps. Moreover, we evaluated each machine learning approach with an approach-specific predefined grid of tuning parameters using the Brier score metric. The best performance was obtained with an ensemble method for two of the reviews, and by a deep learning approach for the other review. Yet, the final performance of approaches strongly depends on data preparation. Overall, machine learning methods provided reasonable classification. It seems possible to reduce human workload in updating systematic reviews by using machine learning methods. Yet, as the influence of data preprocessing on the final performance seems to be at least as important as choosing the specific machine learning approach, users should not blindly expect a good performance by solely using approaches from a popular class, such as deep learning.
更新系统评价通常是一个耗时的过程,需要大量人力,因此未能按应有的频率进行。我们研究项目的目的是探索机器学习方法在减少人力工作量方面的潜力。此外,我们还将深度学习方法的性能与更成熟的机器学习方法进行了比较。我们使用三项现有的诊断测试研究综述作为数据集。为了识别相关出版物,我们使用了典型的文本预处理方法。评估的参考标准是基于二元分类(纳入、排除)的人工共识。为了评估模型,我们使用数据预处理步骤组合的网格生成了各种场景。此外,我们使用布里尔评分指标,通过特定于方法的预定义调优参数网格来评估每种机器学习方法。对于其中两项综述,采用集成方法获得了最佳性能,而对于另一项综述,则采用深度学习方法获得了最佳性能。然而,方法的最终性能在很大程度上取决于数据准备。总体而言,机器学习方法提供了合理的分类。使用机器学习方法似乎有可能减少更新系统评价时的人力工作量。然而,由于数据预处理对最终性能的影响似乎与选择特定的机器学习方法至少同样重要,用户不应仅仅通过使用深度学习等流行类别中的方法就盲目期待良好的性能。