Zhang Danni, Yang Xingyu, Wang Fangying, Qiu Cifang, Chai Yanfu, Fang Danruo
Department of Functional, Shaoxing Hospital of Traditional Chinese Medicine, Shaoxing, 312000, Zhejiang, China.
School of Mechanical and Electrical Engineering, Shaoxing University, Shaoxing, 312000, China.
J Med Syst. 2025 May 29;49(1):72. doi: 10.1007/s10916-025-02203-1.
This study systematically examined the impact of three feature selection techniques (Boruta, Extreme gradient boosting (XGBoost), and Lasso) for optimizing four machine learning models (Random forest (RF), XGBoost, Logistic regression (LR), and Support vector machine (SVM)) in predicting bone density prevalence. Our findings revealed that varying data partitioning ratios (training and test sets: 0.6:0.4; 0.7:0.3; 0.8:0.2; 0.9:0.1) minimally impacted the prediction accuracy across all four models, a conclusion reinforced by 10-fold cross validation. Besides, principal component analysis (PCA) led to substantial accuracy degradation (0.6-0.8 range), suggesting incompatibility with this study's requirements due to the inherent complex decision boundaries in the original high-dimensional data. Comparative analysis demonstrated that the Boruta-XGBoost combination achieved superior performance (accuracy: 0.9083 ± 0.0146), significantly outperforming the Lasso-LR combination (0.7480 ± 0.0157) across all evaluation frameworks. Regarding model evaluation metrics, the RF model exhibited enhanced discriminative capacity with Area under the receiver operating characteristic (AUROC) values of 0.85, 0.81, and 0.80 under different feature selection approaches, surpassing the SVM model (0.78, 0.76, and 0.76). This advantage likely stems from RF's native capability to capture non-linear relationships and feature interactions.
本研究系统地考察了三种特征选择技术(Boruta、极端梯度提升(XGBoost)和套索)对优化四种机器学习模型(随机森林(RF)、XGBoost、逻辑回归(LR)和支持向量机(SVM))预测骨密度患病率的影响。我们的研究结果表明,不同的数据划分比例(训练集和测试集:0.6:0.4;0.7:0.3;0.8:0.2;0.9:0.1)对所有四个模型的预测准确性影响极小,这一结论通过10折交叉验证得到了加强。此外,主成分分析(PCA)导致准确性大幅下降(0.6 - 0.8范围),表明由于原始高维数据中固有的复杂决策边界,其与本研究的要求不兼容。对比分析表明,在所有评估框架中,Boruta - XGBoost组合表现出卓越的性能(准确率:0.9083±0.0146),显著优于套索 - LR组合(0.7480±0.0157)。关于模型评估指标,RF模型在不同特征选择方法下的受试者操作特征曲线下面积(AUROC)值分别为0.85、0.81和0.80,表现出更强的判别能力,超过了SVM模型(0.78、0.76和0.76)。这一优势可能源于RF捕捉非线性关系和特征交互的固有能力。