Nuffield Department of Surgical Sciences, University of Oxford, Oxford, United Kingdom.
Department of Radiology, University of Cambridge, Cambridge, United Kingdom.
JAMA Netw Open. 2021 Mar 1;4(3):e211276. doi: 10.1001/jamanetworkopen.2021.1276.
An increasing number of machine learning (ML)-based clinical decision support systems (CDSSs) are described in the medical literature, but this research focuses almost entirely on comparing CDSS directly with clinicians (human vs computer). Little is known about the outcomes of these systems when used as adjuncts to human decision-making (human vs human with computer).
To conduct a systematic review to investigate the association between the interactive use of ML-based diagnostic CDSSs and clinician performance and to examine the extent of the CDSSs' human factors evaluation.
A search of MEDLINE, Embase, PsycINFO, and grey literature was conducted for the period between January 1, 2010, and May 31, 2019. Peer-reviewed studies published in English comparing human clinician performance with and without interactive use of an ML-based diagnostic CDSSs were included. All metrics used to assess human performance were considered as outcomes. The risk of bias was assessed using Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) and Risk of Bias in Non-Randomised Studies-Intervention (ROBINS-I). Narrative summaries were produced for the main outcomes. Given the heterogeneity of medical conditions, outcomes of interest, and evaluation metrics, no meta-analysis was performed.
A total of 8112 studies were initially retrieved and 5154 abstracts were screened; of these, 37 studies met the inclusion criteria. The median number of participating clinicians was 4 (interquartile range, 3-8). Of the 107 results that reported statistical significance, 54 (50%) were increased by the use of CDSSs, 4 (4%) were decreased, and 49 (46%) showed no change or an unclear change. In the subgroup of studies carried out in representative clinical settings, no association between the use of ML-based diagnostic CDSSs and improved clinician performance could be observed. Interobserver agreement was the commonly reported outcome whose change was the most strongly associated with CDSS use. Four studies (11%) reported on user feedback, and, in all but 1 case, clinicians decided to override at least some of the algorithms' recommendations. Twenty-eight studies (76%) were rated as having a high risk of bias in at least 1 of the 4 QUADAS-2 core domains, and 6 studies (16%) were considered to be at serious or critical risk of bias using ROBINS-I.
This systematic review found only sparse evidence that the use of ML-based CDSSs is associated with improved clinician diagnostic performance. Most studies had a low number of participants, were at high or unclear risk of bias, and showed little or no consideration for human factors. Caution should be exercised when estimating the current potential of ML to improve human diagnostic performance, and more comprehensive evaluation should be conducted before deploying ML-based CDSSs in clinical settings. The results highlight the importance of considering supported human decisions as end points rather than merely the stand-alone CDSSs outputs.
越来越多的基于机器学习(ML)的临床决策支持系统(CDSS)在医学文献中被描述,但这项研究几乎完全集中在直接比较 CDSS 与临床医生(人与计算机)上。对于这些系统作为人类决策的辅助工具(人与计算机辅助的人类)的结果知之甚少。
进行系统评价,以调查基于 ML 的诊断 CDSS 的交互使用与临床医生表现之间的关联,并检查 CDSS 的人为因素评估的程度。
对 MEDLINE、Embase、PsycINFO 和灰色文献进行了为期 2010 年 1 月 1 日至 2019 年 5 月 31 日的搜索。纳入了比较使用和不使用基于 ML 的诊断 CDSS 对人类临床医生表现的同行评审研究。考虑了用于评估人类表现的所有指标作为结果。使用诊断准确性研究的质量评估(QUADAS-2)和非随机研究干预的偏倚风险(ROBINS-I)评估偏倚风险。对于主要结果,生成了叙述性总结。鉴于医疗条件、感兴趣的结果和评估指标的异质性,没有进行荟萃分析。
最初检索到 8112 项研究,筛选了 5154 篇摘要;其中,37 项研究符合纳入标准。参与临床医生的中位数为 4(四分位距,3-8)。在报告具有统计学意义的 107 个结果中,有 54 个(50%)因使用 CDSS 而增加,4 个(4%)减少,49 个(46%)没有变化或变化不明确。在代表临床环境的研究亚组中,无法观察到基于 ML 的诊断 CDSS 的使用与临床医生表现的提高之间存在关联。观察者间一致性是最常报告的结果,其变化与 CDSS 使用的相关性最强。四项研究(11%)报告了用户反馈,除了一个案例外,临床医生决定至少部分覆盖算法的建议。28 项研究(76%)在 QUADAS-2 的至少 4 个核心领域中的 1 个领域被评为高偏倚风险,6 项研究(16%)使用 ROBINS-I 被认为存在严重或关键偏倚风险。
本系统评价仅发现了少量证据表明,基于 ML 的 CDSS 的使用与临床医生诊断表现的提高有关。大多数研究的参与者人数较少,存在高或不明确的偏倚风险,且很少或没有考虑人为因素。在估计 ML 当前提高人类诊断表现的潜力时应谨慎,并且应在临床环境中部署基于 ML 的 CDSS 之前进行更全面的评估。结果强调了将支持人类的决策作为终点而不仅仅是独立的 CDSS 输出考虑的重要性。