Graduate Institute of Applied Science and Engineering, Fu Jen Catholic University, New Taipei City, Taiwan, ROC.
Department of Urology, Cardinal Tien Hospital, School of Medicine, Fu-Jen Catholic University, New Taipei City, Taiwan, ROC.
Sci Rep. 2024 Oct 5;14(1):23234. doi: 10.1038/s41598-024-73799-6.
The prevalence of osteoporosis has drastically increased recently. It is not only the most frequent but is also a major global public health problem due to its high morbidity. There are many risk factors associated with osteoporosis were identified. However, most studies have used the traditional multiple linear regression (MLR) to explore their relationships. Recently, machine learning (Mach-L) has become a new modality for data analysis because it enables machine to learn from past data or experiences without being explicitly programmed and could capture nonlinear relationships better. These methods have the potential to outperform conventional MLR in disease prediction. In the present study, we enrolled a Chinese post-menopause cohort followed up for 4 years. The difference of T-score (δ-T score) was the dependent variable. Information such as demographic, biochemistry and life styles were the independent variables. Our goals were: (1) Compare the prediction accuracy between Mach-L and traditional MLR for δ-T score. (2) Rank the importance of risk factors (independent variables) for prediction of δ T-score. Totally, there were 1698 postmenopausal women were enrolled from MJ Health Database. Four different Mach-L methods namely, Random forest (RF), eXtreme Gradient Boosting (XGBoost), Naïve Bayes (NB), and stochastic gradient boosting (SGB), to construct predictive models for predicting δ-BMD after four years follow-up. The dataset was then randomly divided into an 80% training dataset for model building and a 20% testing dataset for model testing. A 10-fold cross-validation technique for hyperparameter tuning was used. The model with the lowest root mean square error for the validation dataset was viewed as the best model for each ML method. The averaged metrics of the RF, SGB, NB, and XGBoost models were used to compare the model performance of the benchmark MLR model that used the same training and testing dataset as the Mach-L methods. We defined that the priority demonstrated in each model ranked 1 as the most critical risk factor and 22 as the last selected risk factor. For Pearson correlation, age, education, BMI, HDL-C, and TSH were positively and plasma calcium level, and baseline T-score were negatively correlated with δ-T score. All four Mach-L methods yielded lower prediction errors than the MLR method and were all convincing Mach-L models. From our results, it could be noted that education level is the most important factor for δ-T Score, followed by DBP, smoking, SBP, UA, age, and LDL-C. All four Mach-L outperformed traditional MLR. By using Mach-L, the most important six risk factors were selected which are, from the most important to the least: DBP, SBP, UA, education level, TG and sleeping hour. δ T score was positively related to SBP, education level, UA and TG and negatively related to DBP and sleeping hour in postmenopausal Chinese women.
骨质疏松症的患病率最近急剧上升。由于其高发病率,它不仅是最常见的,也是一个主要的全球公共卫生问题。已经确定了许多与骨质疏松症相关的风险因素。然而,大多数研究都使用传统的多元线性回归(MLR)来探索它们之间的关系。最近,机器学习(Mach-L)已成为数据分析的一种新方式,因为它使机器能够从过去的数据或经验中学习,而无需进行显式编程,并能够更好地捕捉非线性关系。这些方法有可能在疾病预测方面优于传统的 MLR。在本研究中,我们招募了一个随访 4 年的中国绝经后队列。T 评分(δ-T 评分)的差异为因变量。人口统计学、生物化学和生活方式等信息为自变量。我们的目标是:(1)比较 Mach-L 和传统 MLR 对 δ-T 评分的预测精度。(2) 对预测 δ-T 评分的风险因素(自变量)进行重要性排序。总共从 MJ 健康数据库中招募了 1698 名绝经后妇女。使用四种不同的 Mach-L 方法,即随机森林(RF)、极端梯度提升(XGBoost)、朴素贝叶斯(NB)和随机梯度提升(SGB),构建预测模型,预测四年后 δ-BMD 的变化。数据集随后随机分为 80%的训练数据集用于模型构建和 20%的测试数据集用于模型测试。使用 10 折交叉验证技术进行超参数调整。在验证数据集中具有最低均方根误差的模型被视为每个 Mach-L 方法的最佳模型。使用 RF、SGB、NB 和 XGBoost 模型的平均指标来比较与 Mach-L 方法使用相同训练和测试数据集的基准 MLR 模型的模型性能。我们定义,每个模型中显示的优先级从 1 表示最重要的风险因素,22 表示最后选择的风险因素。对于 Pearson 相关分析,年龄、教育程度、BMI、HDL-C 和 TSH 与 δ-T 评分呈正相关,而血浆钙水平和基线 T 评分与 δ-T 评分呈负相关。四种 Mach-L 方法的预测误差均低于 MLR 方法,均为可靠的 Mach-L 模型。从我们的结果可以看出,在绝经后中国女性中,教育水平是 δ-T 评分最重要的因素,其次是 DBP、吸烟、SBP、UA、年龄和 LDL-C。所有四种 Mach-L 方法均优于传统 MLR。使用 Mach-L,选择了最重要的六个风险因素,从最重要到最不重要依次为:DBP、SBP、UA、教育水平、TG 和睡眠时间。δ T 评分与 SBP、教育水平、UA 和 TG 呈正相关,与 DBP 和睡眠时间呈负相关。