Zhu Xing-Yu, Li Wei, Yuan Guo-Liang, Pan Xu-Yang
Department of Cardiovascular Medicine, Shu yang Hospital of Traditional Chinese Medicine, Shu Yang, Jiangsu Province, 223600, China.
BMC Med Inform Decis Mak. 2026 Feb 26. doi: 10.1186/s12911-026-03389-1.
Cardiovascular disease constitutes the most formidable public health challenge in China, accounting for 48.98% and 47.35% of mortality in rural and urban populations, respectively, affecting approximately 330 million individuals. Existing risk stratification models predominantly derive from Western populations, with the Framingham Risk Equation systematically overestimating cardiovascular risk by 276% in Chinese men and 102% in Chinese women, underscoring the critical imperative for population-specific predictive instruments. Although machine learning methodologies demonstrate considerable promise in cardiovascular risk prognostication, their inherent "black-box" characteristics substantially impede clinical translational implementation.
Leveraging longitudinal cohort data from the China Health and Retirement Longitudinal Study (CHARLS) and integrating machine learning with explainable artificial intelligence techniques, we sought to develop and validate a cardiovascular disease long-term risk prediction model tailored to the Chinese middle-aged and elderly population, achieving optimal synthesis of predictive accuracy and clinical interpretability through quantitative risk factor contribution analysis.
We incorporated four waves of CHARLS surveillance data spanning 2011-2020, with 8,080 participants aged ≥ 45 years completing 9-year follow-up after rigorous inclusion criteria application. Recursive feature elimination was employed to identify optimal predictors from 90 candidate variables. We systematically evaluated 12 machine learning algorithms encompassing linear, non-linear, ensemble learning, and deep learning methodologies, utilizing stratified random 7:3 partitioning for training and validation cohorts. SHAP (SHapley Additive exPlanations) methodology facilitated comprehensive global and local interpretability analyses, with decision curve analysis assessing clinical net benefit.
Among 5,699 training cohort participants, 1,248 (21.9%) experienced cardiovascular events during follow-up. Recursive feature elimination identified 18 pivotal predictive factors spanning lipid metabolism, anthropometric parameters, renal function, and glucose homeostasis domains. The gradient boosting machine demonstrated superior comprehensive performance, achieving validation cohort AUC of 0.798 (95% CI: 0.776-0.820), specificity of 98%, and positive predictive value of 78%. SHAP analysis revealed waist circumference, triglycerides, and hypertension history as the three predominant predictive factors, with mean absolute SHAP values significantly exceeding other variables. Individual risk attribution analysis demonstrated substantial heterogeneity: extremely high-risk specimens (predicted probability 0.991) exhibited synergistic multi-factorial risk amplification, with standardized waist circumference contributing + 0.0778 SHAP value and triglycerides (477 mg/dL) contributing + 0.0729; conversely, low-risk specimens (predicted probability - 0.0393) demonstrated triglycerides (45.1 mg/dL) providing the maximal singular protective contribution of -0.166. Decision curve analysis confirmed positive net benefit across the 0-0.95 threshold probability spectrum, systematically surpassing conventional strategies.
The gradient boosting machine model achieved superior discrimination (AUC 0.798, 95% CI 0.785-0.825) compared to Framingham (0.638) and China-PAR (0.654) scores for 9-year cardiovascular disease prediction in Chinese adults aged ≥ 45 years. Waist circumference, triglycerides, and hypertension emerged as principal predictive features, though SHAP-derived importance reflects statistical contribution rather than causal effects. Decision curve analysis demonstrated clinical utility across threshold probabilities 0.05-0.95, enabling flexible deployment from population screening (98.3% sensitivity) to targeted intervention (98.7% specificity). External validation in independent cohorts is essential to establish generalizability before clinical implementation.
Not applicable.
心血管疾病是中国最严峻的公共卫生挑战,分别占农村和城市人口死亡率的48.98%和47.35%,影响约3.3亿人。现有的风险分层模型主要源自西方人群,弗雷明汉风险方程在中国男性和女性中分别系统性地高估心血管风险276%和102%,凸显了针对特定人群的预测工具的迫切需求。尽管机器学习方法在心血管风险预测方面显示出巨大潜力,但其固有的“黑箱”特性严重阻碍了临床转化应用。
利用中国健康与养老追踪调查(CHARLS)的纵向队列数据,并将机器学习与可解释人工智能技术相结合,我们旨在开发并验证一个针对中国中老年人群的心血管疾病长期风险预测模型,通过定量风险因素贡献分析实现预测准确性和临床可解释性的最佳综合。
我们纳入了2011 - 2020年CHARLS的四轮监测数据,8080名年龄≥45岁的参与者在严格应用纳入标准后完成了9年随访。采用递归特征消除法从90个候选变量中识别最佳预测因子。我们系统评估了12种机器学习算法,包括线性、非线性、集成学习和深度学习方法,利用分层随机7:3划分训练和验证队列。SHAP(Shapley值加法解释)方法促进了全面的全局和局部可解释性分析,决策曲线分析评估临床净效益。
在5699名训练队列参与者中,1248名(21.9%)在随访期间发生心血管事件。递归特征消除法确定了18个关键预测因素,涵盖脂质代谢、人体测量参数、肾功能和葡萄糖稳态领域。梯度提升机表现出卓越的综合性能,验证队列的AUC为0.798(95%CI:0.776 - 0.820),特异性为98%,阳性预测值为78%。SHAP分析显示腰围、甘油三酯和高血压病史是三个主要预测因素,平均绝对SHAP值显著超过其他变量。个体风险归因分析显示出显著的异质性:极高风险样本(预测概率0.991)表现出协同多因素风险放大,标准化腰围贡献+0.0778的SHAP值,甘油三酯(477mg/dL)贡献+0.0729;相反,低风险样本(预测概率 - 0.0393)显示甘油三酯(45.1mg/dL)提供最大的单一保护贡献 - 0.166。决策曲线分析证实了在0 - 0.95阈值概率范围内的正净效益,系统地超过了传统策略。
与弗雷明汉(0.638)和中国PAR(0.654)评分相比,梯度提升机模型在预测≥45岁中国成年人9年心血管疾病方面具有更好的辨别力(AUC 0.798,95%CI 0.785 - 0.825)。腰围、甘油三酯和高血压成为主要预测特征,尽管SHAP衍生的重要性反映的是统计贡献而非因果效应。决策曲线分析证明了在阈值概率0.05 - 0.95范围内的临床实用性,能够灵活应用于从人群筛查(敏感性98.3%)到靶向干预(特异性98.7%)。在临床应用前,独立队列的外部验证对于确立普遍性至关重要。
不适用。