Saleem Muniba, Aslam Waqar, Lali Muhammad Ikram Ullah, Rauf Hafiz Tayyab, Nasr Emad Abouel
Department of Computer Science & Information Technology, The Government Sadiq College Women University Bahawalpur, Bahawalpur 63100, Pakistan.
Department of Information Security, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan.
Diagnostics (Basel). 2023 Nov 14;13(22):3441. doi: 10.3390/diagnostics13223441.
Thalassemia represents one of the most common genetic disorders worldwide, characterized by defects in hemoglobin synthesis. The affected individuals suffer from malfunctioning of one or more of the four globin genes, leading to chronic hemolytic anemia, an imbalance in the hemoglobin chain ratio, iron overload, and ineffective erythropoiesis. Despite the challenges posed by this condition, recent years have witnessed significant advancements in diagnosis, therapy, and transfusion support, significantly improving the prognosis for thalassemia patients. This research empirically evaluates the efficacy of models constructed using classification methods and explores the effectiveness of relevant features that are derived using various machine-learning techniques. Five feature selection approaches, namely Chi-Square (χ2), Exploratory Factor Score (EFS), tree-based Recursive Feature Elimination (RFE), gradient-based RFE, and Linear Regression Coefficient, were employed to determine the optimal feature set. Nine classifiers, namely K-Nearest Neighbors (KNN), Decision Trees (DT), Gradient Boosting Classifier (GBC), Linear Regression (LR), AdaBoost, Extreme Gradient Boosting (XGB), Random Forest (RF), Light Gradient Boosting Machine (LGBM), and Support Vector Machine (SVM), were utilized to evaluate the performance. The χ2 method achieved accuracy, registering 91.56% precision, 91.04% recall, and 92.65% f-score when aligned with the LR classifier. Moreover, the results underscore that amalgamating over-sampling with Synthetic Minority Over-sampling Technique (SMOTE), RFE, and 10-fold cross-validation markedly elevates the detection accuracy for αT patients. Notably, the Gradient Boosting Classifier (GBC) achieves 93.46% accuracy, 93.89% recall, and 92.72% F1 score.
地中海贫血是全球最常见的遗传性疾病之一,其特征是血红蛋白合成缺陷。受影响的个体一个或多个四个珠蛋白基因功能异常,导致慢性溶血性贫血、血红蛋白链比例失衡、铁过载和无效红细胞生成。尽管这种疾病带来了诸多挑战,但近年来在诊断、治疗和输血支持方面取得了显著进展,显著改善了地中海贫血患者的预后。本研究通过实证评估使用分类方法构建的模型的疗效,并探索使用各种机器学习技术得出的相关特征的有效性。采用了五种特征选择方法,即卡方检验(χ2)、探索性因子得分(EFS)、基于树的递归特征消除(RFE)、基于梯度的RFE和线性回归系数,以确定最佳特征集。使用了九个分类器,即K近邻(KNN)、决策树(DT)、梯度提升分类器(GBC)、线性回归(LR)、AdaBoost、极端梯度提升(XGB)、随机森林(RF)、轻量级梯度提升机(LGBM)和支持向量机(SVM)来评估性能。当与LR分类器结合时,χ2方法的准确率达到91.56%,召回率为91.04%,F值为92.65%。此外,结果强调将过采样与合成少数过采样技术(SMOTE)、RFE和10折交叉验证相结合,可显著提高αT患者的检测准确率。值得注意的是,梯度提升分类器(GBC)的准确率达到93.46%,召回率为93.89%,F1值为92.72%。