Li Yan, Sperrin Matthew, Ashcroft Darren M, van Staa Tjeerd Pieter
Health e-Research Centre, Health Data Research UK North, School of Health Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, Manchester M13 9PL, UK.
Centre for Pharmacoepidemiology and Drug Safety, School of Health Sciences, Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK.
BMJ. 2020 Nov 4;371:m3919. doi: 10.1136/bmj.m3919.
To assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.
Longitudinal cohort study from 1 January 1998 to 31 December 2018.
3.6 million patients from the Clinical Practice Research Datalink registered at 391 general practices in England with linked hospital admission and mortality records.
Model performance including discrimination, calibration, and consistency of individual risk prediction for the same patients among models with comparable model performance. 19 different prediction techniques were applied, including 12 families of machine learning models (grid searched for best models), three Cox proportional hazards models (local fitted, QRISK3, and Framingham), three parametric survival models, and one logistic model.
The various models had similar population level performance (C statistics of about 0.87 and similar calibration). However, the predictions for individual risks of cardiovascular disease varied widely between and within different types of machine learning and statistical models, especially in patients with higher risks. A patient with a risk of 9.5-10.5% predicted by QRISK3 had a risk of 2.9-9.2% in a random forest and 2.4-7.2% in a neural network. The differences in predicted risks between QRISK3 and a neural network ranged between -23.2% and 0.1% (95% range). Models that ignored censoring (that is, assumed censored patients to be event free) substantially underestimated risk of cardiovascular disease. Of the 223 815 patients with a cardiovascular disease risk above 7.5% with QRISK3, 57.8% would be reclassified below 7.5% when using another model.
A variety of models predicted risks for the same patients very differently despite similar model performances. The logistic models and commonly used machine learning models should not be directly applied to the prediction of long term risks without considering censoring. Survival models that consider censoring and that are explainable, such as QRISK3, are preferable. The level of consistency within and between models should be routinely assessed before they are used for clinical decision making.
评估机器学习和统计技术在预测心血管疾病个体水平和人群水平风险方面的一致性,以及删失对风险预测的影响。
1998年1月1日至2018年12月31日的纵向队列研究。
来自临床实践研究数据链的360万患者,在英格兰的391家全科诊所登记,并与医院入院和死亡记录相关联。
模型性能,包括具有可比模型性能的模型之间对相同患者个体风险预测的区分度、校准度和一致性。应用了19种不同的预测技术,包括12个机器学习模型家族(通过网格搜索寻找最佳模型)、三个Cox比例风险模型(局部拟合、QRISK3和弗明汉模型)、三个参数生存模型和一个逻辑模型。
各种模型在人群水平上具有相似的性能(C统计量约为0.87,校准度相似)。然而,不同类型的机器学习和统计模型之间以及内部对心血管疾病个体风险的预测差异很大,尤其是在高风险患者中。QRISK3预测风险为9.5 - 10.5%的患者,在随机森林中的风险为2.9 - 9.2%,在神经网络中的风险为2.4 - 7.2%。QRISK3与神经网络之间预测风险的差异在 - 23.2%至0.1%之间(95%范围)。忽略删失的模型(即假设删失患者无事件发生)会大幅低估心血管疾病风险。在QRISK3预测心血管疾病风险高于7.5%的223815名患者中,使用另一种模型时,57.8%的患者会被重新分类到7.5%以下。
尽管模型性能相似,但多种模型对相同患者的风险预测差异很大。在不考虑删失的情况下,逻辑模型和常用的机器学习模型不应直接应用于长期风险预测。考虑删失且可解释的生存模型,如QRISK3,更可取。在将模型用于临床决策之前,应常规评估模型内部和之间的一致性水平。