Google DeepMind, Mountain View, CA, USA.
Google Research, New York, NY, USA.
Nat Med. 2023 Jul;29(7):1814-1820. doi: 10.1038/s41591-023-02437-x. Epub 2023 Jul 17.
Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.
基于深度学习的预测人工智能 (AI) 系统已被证明能够在多种医学成像环境中达到专家级别的疾病识别水平,但在临床医生准确诊断的病例中可能会出错,反之亦然。我们开发了互补驱动的临床工作流程延迟 (CoDoC),这是一种可以学习在预测 AI 模型的意见和临床工作流程之间做出决策的系统。在用于筛查乳腺癌或结核病 (TB) 的临床工作流程中,CoDoC 相对于仅临床医生或仅 AI 的基线提高了准确性。对于乳腺癌筛查,与英国筛查计划中的双重阅读和仲裁相比,CoDoC 在相同的假阴性率下将假阳性减少了 25%,同时将临床医生的工作量减少了 66%。对于 TB 分诊,与独立的 AI 和临床工作流程相比,CoDoC 在相同的假阴性率下,对于五个商业上可用的预测 AI 系统中的三个,假阳性减少了 5-15%。为了促进 CoDoC 在新颖的未来临床环境中的部署,我们展示了结果,表明 CoDoC 的性能增益在几个变化轴(成像方式、临床环境和预测 AI 系统)上保持不变,并讨论了我们的评估的局限性以及需要进一步验证的地方。我们提供了一个开源实现,以鼓励进一步的研究和应用。