Department of Community Health Sciences, University of Calgary, 3280 Hospital Drive NW, Calgary, AB, T2N 4Z6, Canada.
Department of Family Medicine, University of Calgary, 3330 Hospital Drive NW, Calgary, AB, T2N 4N1, Canada.
Sci Rep. 2023 Jan 2;13(1):13. doi: 10.1038/s41598-022-27264-x.
Risk prediction models are frequently used to identify individuals at risk of developing hypertension. This study evaluates different machine learning algorithms and compares their predictive performance with the conventional Cox proportional hazards (PH) model to predict hypertension incidence using survival data. This study analyzed 18,322 participants on 24 candidate features from the large Alberta's Tomorrow Project (ATP) to develop different prediction models. To select the top features, we applied five feature selection methods, including two filter-based: a univariate Cox p-value and C-index; two embedded-based: random survival forest and least absolute shrinkage and selection operator (Lasso); and one constraint-based: the statistically equivalent signature (SES). Five machine learning algorithms were developed to predict hypertension incidence: penalized regression Ridge, Lasso, Elastic Net (EN), random survival forest (RSF), and gradient boosting (GB), along with the conventional Cox PH model. The predictive performance of the models was assessed using C-index. The performance of machine learning algorithms was observed, similar to the conventional Cox PH model. Average C-indexes were 0.78, 0.78, 0.78, 0.76, 0.76, and 0.77 for Ridge, Lasso, EN, RSF, GB and Cox PH, respectively. Important features associated with each model were also presented. Our study findings demonstrate little predictive performance difference between machine learning algorithms and the conventional Cox PH regression model in predicting hypertension incidence. In a moderate dataset with a reasonable number of features, conventional regression-based models perform similar to machine learning algorithms with good predictive accuracy.
风险预测模型常用于识别发生高血压的高危个体。本研究评估了不同的机器学习算法,并将其预测性能与传统的 Cox 比例风险(PH)模型进行比较,使用生存数据预测高血压的发生。本研究分析了来自大型艾伯塔省明日计划(ATP)的 24 个候选特征的 18322 名参与者,以开发不同的预测模型。为了选择最优特征,我们应用了五种特征选择方法,包括两种基于过滤的方法:单变量 Cox p 值和 C 指数;两种基于嵌入的方法:随机生存森林和最小绝对收缩和选择算子(Lasso);以及一种基于约束的方法:统计学等效签名(SES)。本研究还开发了五种机器学习算法来预测高血压的发生:惩罚回归 Ridge、Lasso、弹性网络(EN)、随机生存森林(RSF)和梯度提升(GB),以及传统的 Cox PH 模型。通过 C 指数评估模型的预测性能。机器学习算法的性能与传统的 Cox PH 模型相似。Ridge、Lasso、EN、RSF、GB 和 Cox PH 的平均 C 指数分别为 0.78、0.78、0.78、0.76、0.76 和 0.77。还展示了与每个模型相关的重要特征。我们的研究结果表明,在预测高血压的发生方面,机器学习算法与传统的 Cox PH 回归模型之间的预测性能差异较小。在具有合理特征数量的中等数据集上,基于传统回归的模型与具有良好预测准确性的机器学习算法表现相似。