Zheng Qizhi, Zhao Ayang, Wang Xinzhu, Bai Yanhong, Wang Zikun, Wang Xiuying, Zeng Xianzhang, Dong Guanghui
College of Computer and Control Engineering, Northeast Forestry University, No.26, Hexing Road, Xiangfang District, Harbin, 150040, China.
School of Medicine and Health, Key Laboratory of Micro-systems and Micro-structures Manufacturing (Ministry of Education), Harbin Institute of Technology, Harbin, 150001, China.
BMC Neurol. 2025 May 31;25(1):236. doi: 10.1186/s12883-025-04261-x.
Identifying and managing high-risk populations for stroke in a targeted manner is a key area of preventive healthcare.
To assess machine learning (ML) models and causal inference of time series analysis for predicting stroke clinically meaningful model.
This is a retrospective cohort study and data is from China Health and Retirement Longitudinal Study (CHARLS) assessed 11,789 adults in China from 2011 to 2018. Data analysis was performed from June 1 to December 1, 2024.
CHARLS adopts a multi-stage probability sampling method, covering samples from 28 provinces, and collects data every two years through computer-aided personal interviews (CAPI).
This study employed a combination of Vector Autoregression (VAR) model and Graph Neural Networks (GNN) to systematically construct dynamic causal inference. Multiple classic classification algorithms were compared, including Random Forest, Logistic Regression, XGBoost, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosting, and Multi-Layer Perceptron (MLP). The Synthetic Minority Oversampling Technique (SMOTE) algorithm was used to undersample a small number of samples and employed Stratified K-fold Cross Validation.
MAIN OUTCOME(S) AND MEASURE(S): AUC (Area Under the Curve), Accuracy, Precision, Recall, F1 Score, and Matthews Correlation Coefficient (MCC).
This study included a total of 11,789 participants, including 6,334 females (53.73%) and 5,455 males (46.27%), with an average age of 65 years. Introduction of dynamic causal inference features has significantly improved the performance of almost all models. The area under the ROC curve of each model ranged from 0.78 to 0.83, indicating significant difference (P < 0.01). Among all the models, the Gradient Boosting model demonstrated the highest performance and stability. Model explanation and feature importance analysis generated model interpretation that illustrated significant contributors associated with risks of stroke.
This study proposes a stroke risk prediction method that combines dynamic causal inference with machine learning models, significantly improving prediction accuracy and revealing key health factors that affect stroke. The research results indicate that dynamic causal inference features have important value in predicting stroke risk, especially in capturing the impact of changes in health status over time on stroke risk. By further optimizing the model and introducing more variables, this study provides theoretical basis and practical guidance for future stroke prevention and intervention strategies.
IRB00001052-11015.1.2.
有针对性地识别和管理中风高危人群是预防性医疗保健的关键领域。
评估机器学习(ML)模型和时间序列分析的因果推断,以预测中风的临床有意义模型。
这是一项回顾性队列研究,数据来自中国健康与养老追踪调查(CHARLS),该调查在2011年至2018年期间对中国11789名成年人进行了评估。数据分析于2024年6月1日至12月1日进行。
CHARLS采用多阶段概率抽样方法,覆盖28个省份的样本,并通过计算机辅助个人访谈(CAPI)每两年收集一次数据。
本研究采用向量自回归(VAR)模型和图神经网络(GNN)相结合的方法,系统地构建动态因果推断。比较了多种经典分类算法,包括随机森林、逻辑回归、XGBoost、支持向量机(SVM)、K近邻(KNN)、梯度提升和多层感知器(MLP)。使用合成少数过采样技术(SMOTE)算法对少量样本进行欠采样,并采用分层K折交叉验证。
曲线下面积(AUC)、准确率、精确率、召回率、F1分数和马修斯相关系数(MCC)。
本研究共纳入11789名参与者,其中女性6334名(53.73%),男性5455名(46.27%),平均年龄65岁。引入动态因果推断特征显著提高了几乎所有模型的性能。每个模型的ROC曲线下面积在0.78至0.83之间,差异有统计学意义(P<0.01)。在所有模型中,梯度提升模型表现出最高的性能和稳定性。模型解释和特征重要性分析生成了模型解释,说明了与中风风险相关的重要因素。
本研究提出了一种将动态因果推断与机器学习模型相结合的中风风险预测方法,显著提高了预测准确性,并揭示了影响中风的关键健康因素。研究结果表明,动态因果推断特征在预测中风风险方面具有重要价值,特别是在捕捉健康状况随时间变化对中风风险的影响方面。通过进一步优化模型并引入更多变量,本研究为未来中风预防和干预策略提供了理论依据和实践指导。
IRB00001052 - 11015.1.2。