The State Key Laboratory of Molecular Vaccine and Molecular Diagnostics, School of Public Health, Xiamen University, Xiamen 361102, China.
Key Laboratory of Health Technology Assessment of Fujian Province, School of Public Health, Xiamen University, Xiamen 361102, China.
Int J Environ Res Public Health. 2020 Mar 12;17(6):1828. doi: 10.3390/ijerph17061828.
Timely stroke diagnosis and intervention are necessary considering its high prevalence. Previous studies have mainly focused on stroke prediction with balanced data. Thus, this study aimed to develop machine learning models for predicting stroke with imbalanced data in an elderly population in China. Data were obtained from a prospective cohort that included 1131 participants (56 stroke patients and 1075 non-stroke participants) in 2012 and 2014, respectively. Data balancing techniques including random over-sampling (ROS), random under-sampling (RUS), and synthetic minority over-sampling technique (SMOTE) were used to process the imbalanced data in this study. Machine learning methods such as regularized logistic regression (RLR), support vector machine (SVM), and random forest (RF) were used to predict stroke with demographic, lifestyle, and clinical variables. Accuracy, sensitivity, specificity, and areas under the receiver operating characteristic curves (AUCs) were used for performance comparison. The top five variables for stroke prediction were selected for each machine learning method based on the SMOTE-balanced data set. The total prevalence of stroke was high in 2014 (4.95%), with men experiencing much higher prevalence than women (6.76% vs. 3.25%). The three machine learning methods performed poorly in the imbalanced data set with extremely low sensitivity (approximately 0.00) and AUC (approximately 0.50). After using data balancing techniques, the sensitivity and AUC considerably improved with moderate accuracy and specificity, and the maximum values for sensitivity and AUC reached 0.78 (95% CI, 0.73-0.83) for RF and 0.72 (95% CI, 0.71-0.73) for RLR. Using AUCs for RLR, SVM, and RF in the imbalanced data set as references, a significant improvement was observed in the AUCs of all three machine learning methods ( < 0.05) in the balanced data sets. Considering RLR in each data set as a reference, only RF in the imbalanced data set and SVM in the ROS-balanced data set were superior to RLR in terms of AUC. Sex, hypertension, and uric acid were common predictors in all three machine learning methods. Blood glucose level was included in both RLR and RF. Drinking, age and high-sensitivity C-reactive protein level, and low-density lipoprotein cholesterol level were also included in RLR, SVM, and RF, respectively. Our study suggests that machine learning methods with data balancing techniques are effective tools for stroke prediction with imbalanced data.
鉴于其高发病率,及时进行中风诊断和干预是必要的。以前的研究主要集中在使用平衡数据进行中风预测上。因此,本研究旨在为中国老年人群中使用不平衡数据开发中风预测的机器学习模型。数据来自于一个前瞻性队列,其中包括 2012 年和 2014 年的 1131 名参与者(56 名中风患者和 1075 名非中风患者)。本研究采用随机过采样(ROS)、随机欠采样(RUS)和合成少数过采样技术(SMOTE)等数据平衡技术来处理不平衡数据。使用正则化逻辑回归(RLR)、支持向量机(SVM)和随机森林(RF)等机器学习方法,基于人口统计学、生活方式和临床变量预测中风。使用准确性、敏感性、特异性和接收者操作特征曲线(AUC)下的面积来进行性能比较。根据 SMOTE 平衡数据集,为每种机器学习方法选择了预测中风的前五个最重要变量。2014 年中风总发病率较高(4.95%),男性发病率明显高于女性(6.76% vs. 3.25%)。三种机器学习方法在不平衡数据集中表现不佳,敏感性极低(约为 0.00),AUC 约为 0.50。使用数据平衡技术后,敏感性和 AUC 有了很大的提高,准确性和特异性适中,RF 的最大敏感性和 AUC 达到 0.78(95%CI,0.73-0.83),RLR 的最大敏感性和 AUC 达到 0.72(95%CI,0.71-0.73)。以不平衡数据集中 RLR、SVM 和 RF 的 AUC 为参考,在平衡数据集中,所有三种机器学习方法的 AUC 均有显著提高(<0.05)。以每个数据集的 RLR 为参考,仅在不平衡数据集中的 RF 和 ROS 平衡数据集中的 SVM 在 AUC 方面优于 RLR。性别、高血压和尿酸是所有三种机器学习方法的常见预测因素。血糖水平同时包含在 RLR 和 RF 中。饮酒、年龄和高敏 C 反应蛋白水平以及低密度脂蛋白胆固醇水平也分别包含在 RLR、SVM 和 RF 中。我们的研究表明,使用数据平衡技术的机器学习方法是处理不平衡数据中风预测的有效工具。