Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden.
Division of Clinical Geriatrics, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, Stockholm, Sweden.
Med Image Anal. 2020 Dec;66:101714. doi: 10.1016/j.media.2020.101714. Epub 2020 May 1.
Deep learning (DL) methods have in recent years yielded impressive results in medical imaging, with the potential to function as clinical aid to radiologists. However, DL models in medical imaging are often trained on public research cohorts with images acquired with a single scanner or with strict protocol harmonization, which is not representative of a clinical setting. The aim of this study was to investigate how well a DL model performs in unseen clinical datasets-collected with different scanners, protocols and disease populations-and whether more heterogeneous training data improves generalization. In total, 3117 MRI scans of brains from multiple dementia research cohorts and memory clinics, that had been visually rated by a neuroradiologist according to Scheltens' scale of medial temporal atrophy (MTA), were included in this study. By training multiple versions of a convolutional neural network on different subsets of this data to predict MTA ratings, we assessed the impact of including images from a wider distribution during training had on performance in external memory clinic data. Our results showed that our model generalized well to datasets acquired with similar protocols as the training data, but substantially worse in clinical cohorts with visibly different tissue contrasts in the images. This implies that future DL studies investigating performance in out-of-distribution (OOD) MRI data need to assess multiple external cohorts for reliable results. Further, by including data from a wider range of scanners and protocols the performance improved in OOD data, which suggests that more heterogeneous training data makes the model generalize better. To conclude, this is the most comprehensive study to date investigating the domain shift in deep learning on MRI data, and we advocate rigorous evaluation of DL models on clinical data prior to being certified for deployment.
深度学习(DL)方法近年来在医学成像领域取得了令人瞩目的成果,有可能成为放射科医生的临床辅助工具。然而,医学成像中的 DL 模型通常是在具有单一扫描仪或严格协议协调的公共研究队列上进行训练的,这与临床环境并不具有代表性。本研究旨在探讨 DL 模型在未见临床数据集(使用不同扫描仪、协议和疾病人群采集)中的表现如何,以及更多异构训练数据是否能提高泛化能力。本研究共纳入了来自多个痴呆症研究队列和记忆诊所的 3117 例大脑 MRI 扫描,这些扫描已由神经放射科医生根据 Scheltens 内侧颞叶萎缩(MTA)量表进行了视觉评分。通过在不同的数据子集上训练多个卷积神经网络版本来预测 MTA 评分,我们评估了在训练中包含更广泛分布的图像对外部记忆诊所数据性能的影响。我们的研究结果表明,我们的模型在与训练数据相似的协议采集的数据集上具有很好的泛化能力,但在图像中组织对比度明显不同的临床队列中表现明显较差。这意味着,未来研究使用分布外(OOD)MRI 数据评估性能的深度学习研究需要评估多个外部队列,以获得可靠的结果。此外,通过包含更广泛的扫描仪和协议的数据,OOD 数据的性能得到了提高,这表明更多异质的训练数据使模型具有更好的泛化能力。总之,这是迄今为止对 MRI 数据中深度学习领域转移最全面的研究,我们提倡在获得认证部署之前,对临床数据进行严格的 DL 模型评估。