Gul Sheema, Muhammad Khan Dost, Aldahmani Saeed, Khan Zardad
Department of Statistics, Abdul Wali Khan University, Mardan, Pakistan.
Department of Statistics and Business Analytics, United Arab Emirates University, Al Ain, United Arab Emirates.
PLoS One. 2025 Jun 10;20(6):e0325147. doi: 10.1371/journal.pone.0325147. eCollection 2025.
High-dimensional gene expression data poses significant challenges for binary classification, particularly in the context of feature selection methods. Conventional methods, for example, Proportional Overlap Score, Wilcoxon Rank-Sum Test, Weighted Signal to Noise Ratio, ensemble Minimum Redundancy and Maximum Relevance, Fisher Score and Robust Weighted Score for unbalanced data are impacted by key challenges, such as, class imbalance and redundancy. To mitigate these issues, customized feature selection methods are required to tackle the class imbalance issue. This study proposes a more robust solution, Margin Weighted Robust Discriminant Score, for feature selection in the context of high dimensional imbalanced problems. MW-RDS integrates a minority amplification factor to ensure the impact of minority class observation during feature ranking process. The amplification factor along with class specific stability weights obtained from minority-focused robust discriminant score are used for achieving maximum differential capability of genes/features. The score is weighted by margin weights extracted from support vectors to enhance the discriminative power of genes/features thereby highlighting its potential for class separation. Finally, top-ranked genes/features are constrained using [Formula: see text]-regularization to discard redundant genes while identifying the most significant ones. The performance of the proposed method is tested on 9 openly accessible gene expression datasets, using Random Forest, Support Vector Machines, and Weighted k Nearest Neighbors classifiers in term of performance metrics, i.e., accuracy, sensitivity, specificity, F1-score, and precision. The results reveal that the proposed method outperforms the existing methods in most of the cases. Boxplots and stability-plots are also generated to gain a deeper understanding of the results. To futher assess the efficacy of the proposed method, the paper also gives a detailed simulation study.
高维基因表达数据给二元分类带来了重大挑战,尤其是在特征选择方法的背景下。传统方法,例如比例重叠分数、威尔科克森秩和检验、加权信噪比、集成最小冗余最大相关性、费舍尔分数以及针对不平衡数据的稳健加权分数,都受到诸如类不平衡和冗余等关键挑战的影响。为了缓解这些问题,需要定制化的特征选择方法来解决类不平衡问题。本研究针对高维不平衡问题的特征选择提出了一种更稳健的解决方案——边际加权稳健判别分数。MW - RDS集成了一个少数类放大因子,以确保在特征排序过程中少数类观测值的影响。该放大因子与从关注少数类的稳健判别分数中获得的类特定稳定性权重一起用于实现基因/特征的最大区分能力。该分数由从支持向量中提取的边际权重加权,以增强基因/特征的判别力,从而突出其类分离潜力。最后,使用[公式:见原文]正则化对排名靠前的基因/特征进行约束,以在识别最重要基因的同时丢弃冗余基因。使用随机森林、支持向量机和加权k近邻分类器,根据性能指标,即准确率、灵敏度、特异性、F1分数和精确率,在9个公开可用的基因表达数据集上测试了所提出方法的性能。结果表明,在大多数情况下,所提出的方法优于现有方法。还生成了箱线图和稳定性图,以更深入地理解结果。为了进一步评估所提出方法的有效性,本文还进行了详细的模拟研究。