Li Yikuan, Salimi-Khorshidi Gholamreza, Rao Shishir, Canoy Dexter, Hassaine Abdelaali, Lukasiewicz Thomas, Rahimi Kazem, Mamouei Mohammad
Deep Medicine, Oxford Martin School, University of Oxford, Hayes House, 75 George Street, Oxford OX1 2BQ, UK.
Nuffield Department of Women's and Reproductive Health, Medical Science Division, University of Oxford, Oxford, UK.
Eur Heart J Digit Health. 2022 Oct 21;3(4):535-547. doi: 10.1093/ehjdh/ztac061. eCollection 2022 Dec.
Deep learning has dominated predictive modelling across different fields, but in medicine it has been met with mixed reception. In clinical practice, simple, statistical models and risk scores continue to inform cardiovascular disease risk predictions. This is due in part to the knowledge gap about how deep learning models perform in practice when they are subject to dynamic data shifts; a key criterion that common internal validation procedures do not address. We evaluated the performance of a novel deep learning model, BEHRT, under data shifts and compared it with several ML-based and established risk models.
Using linked electronic health records of 1.1 million patients across England aged at least 35 years between 1985 and 2015, we replicated three established statistical models for predicting 5-year risk of incident heart failure, stroke, and coronary heart disease. The results were compared with a widely accepted machine learning model (random forests), and a novel deep learning model (BEHRT). In addition to internal validation, we investigated how data shifts affect model discrimination and calibration. To this end, we tested the models on cohorts from (i) distinct geographical regions; (ii) different periods. Using internal validation, the deep learning models substantially outperformed the best statistical models by 6%, 8%, and 11% in heart failure, stroke, and coronary heart disease, respectively, in terms of the area under the receiver operating characteristic curve.
The performance of all models declined as a result of data shifts; despite this, the deep learning models maintained the best performance in all risk prediction tasks. Updating the model with the latest information can improve discrimination but if the prior distribution changes, the model may remain miscalibrated.
深度学习在不同领域的预测建模中占据主导地位,但在医学领域,其接受度却参差不齐。在临床实践中,简单的统计模型和风险评分仍用于心血管疾病风险预测。部分原因在于,对于深度学习模型在面对动态数据变化时的实际表现,存在知识空白;而这是常见的内部验证程序未涉及的关键标准。我们评估了一种新型深度学习模型BEHRT在数据变化情况下的性能,并将其与几种基于机器学习的既定风险模型进行比较。
利用1985年至2015年间英格兰110万年龄至少35岁患者的关联电子健康记录,我们复制了三种既定的统计模型,用于预测心力衰竭、中风和冠心病的5年发病风险。将结果与一个广泛接受的机器学习模型(随机森林)和一种新型深度学习模型(BEHRT)进行比较。除了内部验证,我们还研究了数据变化如何影响模型的辨别力和校准。为此,我们在以下队列上测试模型:(i)不同地理区域;(ii)不同时期。通过内部验证,在接受者操作特征曲线下面积方面,深度学习模型在心力衰竭、中风和冠心病预测中分别比最佳统计模型显著高出6%、8%和11%。
由于数据变化,所有模型的性能均有所下降;尽管如此,深度学习模型在所有风险预测任务中仍保持最佳性能。用最新信息更新模型可提高辨别力,但如果先验分布发生变化,模型可能仍存在校准错误。