Mahmood Nozad Hussein, Kadir Dler Hussein
Department of Statistics and information, College of Administration and Economics, Salahaddin University-Erbil, Erbil, Iraq; Cihan University Sulaimaniya Research Center (CUSRC), Cihan University Sulaimaniya, Sulaymaniyah City, Kurdistan Region, Iraq.
Department of Statistics and information, College of Administration and Economics, Salahaddin University-Erbil, Erbil, Iraq.
Leuk Res. 2025 Mar;150:107663. doi: 10.1016/j.leukres.2025.107663. Epub 2025 Feb 11.
This study investigated the application of sparsity regularization methods to improve the classification of leukemia subtypes using high-dimensional gene expression data. Multinomial logistic regression models with the sparsity methods of Ridge, Lasso, and Elastic Net regularizations were employed to address overfitting and dimensionality issues while enhancing model interpretability. The study used a leukemia cancer dataset from the Curated Microarray Database (CuMiDa), which included gene expression data for 16,383 genes across 281 samples representing seven different types of leukemia. The statistical metrics of Accuracy, Kappa statistics, AUC, and F1-score were used to measure the models' implementation. Besides, the effectiveness and ability of each method in gene selection and dimensional reduction of the models were discussed. Elastic Net regularization was a better technique than the Ridge and Lasso based on overall classification performance; it also reached the highest accuracy along with Kappa values. On the other hand, both Lasso and Elastic Net were making more effective feature selections, creating sparse models that could efficiently discriminate leukemia subtypes. In this way, the results highlighted that the inclusion of sparsity regularization could enhance knowledge and accuracy in such a challenging task of subclass leukemia classification, enabling much more tailored treatments.
本研究探讨了稀疏正则化方法在利用高维基因表达数据改善白血病亚型分类中的应用。采用具有岭回归、套索回归和弹性网络正则化等稀疏方法的多项逻辑回归模型来解决过拟合和维度问题,同时增强模型的可解释性。该研究使用了来自精心策划的微阵列数据库(CuMiDa)的白血病癌症数据集,其中包括代表七种不同类型白血病的281个样本中16383个基因的基因表达数据。使用准确率、卡帕统计量、AUC和F1分数等统计指标来衡量模型的实施情况。此外,还讨论了每种方法在模型的基因选择和降维方面的有效性和能力。基于整体分类性能,弹性网络正则化是比岭回归和套索回归更好的技术;它还达到了最高的准确率以及卡帕值。另一方面,套索回归和弹性网络都能更有效地进行特征选择,创建能够有效区分白血病亚型的稀疏模型。通过这种方式,结果突出表明,在白血病亚类分类这一具有挑战性的任务中,纳入稀疏正则化可以提高认知度和准确率,从而实现更具针对性的治疗。