School of Computing and Information Systems, The University of Melbourne, Melbourne, Victoria, Australia.
Department of Natural Language Processing, MBZUAI, Abu Dhabi, United Arab Emirates.
Res Synth Methods. 2024 Nov;15(6):988-1000. doi: 10.1002/jrsm.1749. Epub 2024 Aug 23.
Existing systems for automating the assessment of risk-of-bias (RoB) in medical studies are supervised approaches that require substantial training data to work well. However, recent revisions to RoB guidelines have resulted in a scarcity of available training data. In this study, we investigate the effectiveness of generative large language models (LLMs) for assessing RoB. Their application requires little or no training data and, if successful, could serve as a valuable tool to assist human experts during the construction of systematic reviews. Following Cochrane's latest guidelines (RoB2) designed for human reviewers, we prepare instructions that are fed as input to LLMs, which then infer the risk associated with a trial publication. We distinguish between two modelling tasks: directly predicting RoB2 from text; and employing decomposition, in which a RoB2 decision is made after the LLM responds to a series of signalling questions. We curate new testing data sets and evaluate the performance of four general- and medical-domain LLMs. The results fall short of expectations, with LLMs seldom surpassing trivial baselines. On the direct RoB2 prediction test set (n = 5993), LLMs perform akin to the baselines (F1: 0.1-0.2). In the decomposition task setup (n = 28,150), similar F1 scores are observed. Our additional comparative evaluation on RoB1 data also reveals results substantially below those of a supervised system. This testifies to the difficulty of solving this task based on (complex) instructions alone. Using LLMs as an assisting technology for assessing RoB2 thus currently seems beyond their reach.
现有的医学研究风险偏倚(RoB)自动化评估系统是监督式方法,需要大量的训练数据才能发挥良好的效果。然而,最近对 RoB 指南的修订导致可用的训练数据稀缺。在这项研究中,我们研究了生成式大型语言模型(LLM)在评估 RoB 方面的有效性。它们的应用需要很少或不需要训练数据,如果成功,它们可以作为一种有价值的工具,在系统评价构建过程中协助人类专家。我们遵循 Cochrane 为人类评审员设计的最新指南(RoB2),准备指令作为输入提供给 LLM,然后由其推断试验出版物的风险。我们区分了两种建模任务:直接从文本中预测 RoB2;和采用分解,其中在 LLM 响应一系列信号问题后做出 RoB2 决策。我们策划了新的测试数据集,并评估了四个通用和医学领域 LLM 的性能。结果低于预期,LLM 很少超过微不足道的基线。在直接 RoB2 预测测试集(n=5993)上,LLM 的表现与基线相似(F1:0.1-0.2)。在分解任务设置(n=28150)中,也观察到类似的 F1 分数。我们对 RoB1 数据的额外比较评估也揭示了远低于监督系统的结果。这证明了仅基于(复杂)指令来解决此任务的困难。因此,目前似乎无法将 LLM 用作评估 RoB2 的辅助技术。