Division of Cardiovascular Medicine and Cardiovascular Institute, Stanford University School of Medicine, Stanford, California, USA.
Department of Electrical Engineering, Stanford University, Stanford, California, USA.
Open Heart. 2021 Oct;8(2). doi: 10.1136/openhrt-2021-001802.
Identifying high-risk patients is crucial for effective cardiovascular disease (CVD) prevention. It is not known whether electronic health record (EHR)-based machine-learning (ML) models can improve CVD risk stratification compared with a secondary prevention risk score developed from randomised clinical trials (Thrombolysis in Myocardial Infarction Risk Score for Secondary Prevention, TRS 2°P).
We identified patients with CVD in a large health system, including atherosclerotic CVD (ASCVD), split into 80% training and 20% test sets. A rich set of EHR patient features was extracted. ML models were trained to estimate 5-year CVD event risk (random forests (RF), gradient-boosted machines (GBM), extreme gradient-boosted models (XGBoost), logistic regression with an L penalty and L penalty (Lasso)). ML models and TRS 2°P were evaluated by the area under the receiver operating characteristic curve (AUC).
The cohort included 32 192 patients (median age 74 years, with 46% female, 63% non-Hispanic white and 12% Asian patients and 23 475 patients with ASCVD). There were 4010 events over 5 years of follow-up. ML models demonstrated good overall performance; XGBoost demonstrated AUC 0.70 (95% CI 0.68 to 0.71) in the full CVD cohort and AUC 0.71 (95% CI 0.69 to 0.73) in patients with ASCVD, with comparable performance by GBM, RF and Lasso. TRS 2°P performed poorly in all CVD (AUC 0.51, 95% CI 0.50 to 0.53) and ASCVD (AUC 0.50, 95% CI 0.48 to 0.52) patients. ML identified nontraditional predictive variables including education level and primary care visits.
In a multiethnic real-world population, EHR-based ML approaches significantly improved CVD risk stratification for secondary prevention.
识别高危患者对于有效的心血管疾病(CVD)预防至关重要。目前尚不清楚基于电子健康记录(EHR)的机器学习(ML)模型是否可以改善 CVD 风险分层,与从随机临床试验开发的二级预防风险评分相比(二级预防溶栓心肌梗死风险评分,TRS 2°P)。
我们在一个大型医疗系统中确定了 CVD 患者,包括动脉粥样硬化性 CVD(ASCVD),分为 80%的训练集和 20%的测试集。提取了丰富的 EHR 患者特征。使用随机森林(RF)、梯度提升机(GBM)、极端梯度提升模型(XGBoost)、具有 L 罚分和 L 罚分的逻辑回归(Lasso)等 ML 模型来估计 5 年 CVD 事件风险。通过接收者操作特征曲线下的面积(AUC)评估 ML 模型和 TRS 2°P。
该队列包括 32192 名患者(中位年龄 74 岁,女性占 46%,非西班牙裔白人占 63%,亚裔患者占 12%,23475 名 ASCVD 患者)。在 5 年的随访中有 4010 例事件。ML 模型表现出良好的整体性能;XGBoost 在全 CVD 队列中 AUC 为 0.70(95%CI 0.68 至 0.71),在 ASCVD 患者中 AUC 为 0.71(95%CI 0.69 至 0.73),GBM、RF 和 Lasso 的性能相当。TRS 2°P 在所有 CVD(AUC 0.51,95%CI 0.50 至 0.53)和 ASCVD(AUC 0.50,95%CI 0.48 至 0.52)患者中表现不佳。ML 确定了非传统预测变量,包括教育水平和初级保健就诊次数。
在一个多民族的真实世界人群中,基于 EHR 的 ML 方法显著改善了二级预防的 CVD 风险分层。