Tye Yi Wei, Chew XinYing, Yusof Umi Kalsom, Tulpar Samat
School of Computer Sciences, Universiti Sains Malaysia, Gelugor, Penang, Malaysia.
School of Computing and Informatics, Albukhary International University, Alor Setar, Kedah, Malaysia.
PLoS One. 2025 Sep 8;20(9):e0331089. doi: 10.1371/journal.pone.0331089. eCollection 2025.
Advances in data collection have resulted in an exponential growth of high-dimensional microarray datasets for binary classification in bioinformatics and medical diagnostics. These datasets generally possess many features but relatively few samples, resulting in challenges associated with the "curse of dimensionality", such as feature redundancy and an elevated risk of overfitting. While traditional feature selection approaches, such as filter-based and wrapper-based approaches, can help to reduce dimensionality, they often struggle to capture feature interactions while adequately preserving model generalization. Therefore, this paper introduces the Adaptive Cluster-Guided Simple, Fast, and Efficient (ACG-SFE) feature selection, a hybrid approach designed to address the challenges of high-dimensional microarray data in binary classification. ACG-SFE enhances the Simple, Fast, and Efficient (SFE) evolutionary feature selection model by integrating hierarchical clustering to dynamically group correlated features based on the optimal number of clusters determined by the Silhouette index, Davies-Bouldin score, and the feature-to-observation ratio while adaptively selecting representative features within clusters using mutual information and adjusting the selection threshold through a progress factor. This hybrid filter-wrapper approach improves feature interactions, effectively minimizing redundancy and overfitting while enhancing classification performance. The proposed model is assessed against four state-of-the-art evolutionary feature selection models on 11 high-dimensional microarray datasets. Experimental results indicate that ACG-SFE effectively selects a small yet pertinent feature subset, minimizing redundancy while attaining enhanced classification accuracy and F-measure. Furthermore, its reduced RMSE between train and test accuracy substantiates its capability to reduce overfitting, outperforming comparative models. These findings establish ACG-SFE as an effective feature selection model for handling high-dimensional microarray data in binary classification, enhancing classification accuracy while selecting minimal relevant features to reduce unnecessary complexity and the risk of overfitting.
数据收集方面的进展导致生物信息学和医学诊断中用于二元分类的高维微阵列数据集呈指数级增长。这些数据集通常具有许多特征,但样本相对较少,从而带来了与“维度诅咒”相关的挑战,如特征冗余和过拟合风险增加。虽然传统的特征选择方法,如基于过滤器和基于包装器的方法,有助于降低维度,但它们往往难以捕捉特征之间的相互作用,同时又能充分保持模型的泛化能力。因此,本文介绍了自适应聚类引导的简单、快速且高效(ACG-SFE)特征选择方法,这是一种混合方法,旨在应对二元分类中高维微阵列数据的挑战。ACG-SFE通过集成层次聚类来增强简单、快速且高效(SFE)进化特征选择模型,根据轮廓系数、戴维斯-布尔丁指数和特征与观测值比率确定的最优聚类数动态地对相关特征进行分组,同时使用互信息在聚类中自适应地选择代表性特征,并通过一个进度因子调整选择阈值。这种混合的过滤器-包装器方法改善了特征之间的相互作用,有效减少了冗余和过拟合,同时提高了分类性能。在11个高维微阵列数据集上,将所提出的模型与四种先进的进化特征选择模型进行了评估。实验结果表明,ACG-SFE有效地选择了一个小而相关的特征子集,在最小化冗余的同时提高了分类准确率和F值。此外,其训练和测试准确率之间降低的均方根误差证实了它减少过拟合的能力,优于比较模型。这些发现确立了ACG-SFE作为处理二元分类中高维微阵列数据的有效特征选择模型,在选择最少相关特征以减少不必要的复杂性和过拟合风险的同时提高了分类准确率。