Department of Computer Sciences, Yusuf Maitama Sule University, 700222 Kofar Nassarawa, Kano, Nigeria.
School of Computer Sciences, Universiti Sains Malaysia, 11800 Gelugor, Malaysia.
Genes (Basel). 2020 Jun 27;11(7):717. doi: 10.3390/genes11070717.
The training machine learning algorithm from an imbalanced data set is an inherently challenging task. It becomes more demanding with limited samples but with a massive number of features (high dimensionality). The high dimensional and imbalanced data set has posed severe challenges in many real-world applications, such as biomedical data sets. Numerous researchers investigated either imbalanced class or high dimensional data sets and came up with various methods. Nonetheless, few approaches reported in the literature have addressed the intersection of the high dimensional and imbalanced class problem due to their complicated interactions. Lately, feature selection has become a well-known technique that has been used to overcome this problem by selecting discriminative features that represent minority and majority class. This paper proposes a new method called Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA); rCBR-BGOA has employed an ensemble of multi-filters coupled with the Correlation-Based Redundancy method to select optimal feature subsets. A binary Grasshopper optimisation algorithm (BGOA) is used to construct the feature selection process as an optimisation problem to select the best (near-optimal) combination of features from the majority and minority class. The obtained results, supported by the proper statistical analysis, indicate that rCBR-BGOA can improve the classification performance for high dimensional and imbalanced datasets in terms of G-mean and the Area Under the Curve (AUC) performance metrics.
从不平衡数据集训练机器学习算法是一项具有挑战性的任务。当样本数量有限但特征数量(高维)很大时,任务变得更加困难。高维和不平衡数据集在许多实际应用中带来了严峻挑战,例如生物医学数据集。许多研究人员研究了不平衡类或高维数据集,并提出了各种方法。然而,由于它们之间的复杂相互作用,文献中很少有方法能够解决高维和不平衡类问题的交集。最近,特征选择已成为一种众所周知的技术,通过选择代表少数类和多数类的有判别力的特征来克服这个问题。本文提出了一种称为基于稳健相关性的冗余和二进制蚱蜢优化算法(rCBR-BGOA)的新方法;rCBR-BGOA 采用了与基于相关性的冗余方法相结合的多滤波器集成,以选择最优的特征子集。二进制蚱蜢优化算法(BGOA)用于构建特征选择过程,将其作为一个优化问题,从多数类和少数类中选择最佳(近最优)的特征组合。适当的统计分析支持的结果表明,rCBR-BGOA 可以提高高维和不平衡数据集的分类性能,在 G-均值和曲线下面积 (AUC) 性能指标方面都有所提升。