Department of Physics, Ryerson University, Toronto, ON, Canada.
Department of Computer Science, University of Regina, Regina, Canada.
Comput Biol Med. 2023 Sep;164:107309. doi: 10.1016/j.compbiomed.2023.107309. Epub 2023 Jul 31.
Gene selection as a problem with high dimensions has drawn considerable attention in machine learning and computational biology over the past decade. In the field of gene selection in cancer datasets, different types of feature selection techniques in terms of strategy (filter, wrapper and embedded) and label information (supervised, unsupervised, and semi-supervised) have been developed. However, using hybrid feature selection can still improve the performance. In this paper, we propose a hybrid feature selection based on filter and wrapper strategies. In the filter-phase, we develop an unsupervised features selection based on non-convex regularized non-negative matrix factorization and structure learning, which we deem NCNMFSL. In the wrapper-phase, for the first time, mushroom reproduction optimization (MRO) is leveraged to obtain the most informative features subset. In this hybrid feature selection method, irrelevant features are filtered-out through NCNMFSL, and most discriminative features are selected by MRO. To show the effectiveness and proficiency of the proposed method, numerical experiments are conducted on Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85 benchmark datasets. SVM and decision tree classifiers are leveraged to analyze proposed technique and top accuracy are 0.97, 0.84, 0.98, 0.95, 0.98, 0.87 and 0.85 for Breast, Heart, Colon, Leukemia, Prostate, Tox-171 and GLI-85, respectively. The computational results show the effectiveness of the proposed method in comparison with state-of-art feature selection techniques.
基因选择作为一个具有高维度的问题,在过去十年中引起了机器学习和计算生物学领域的广泛关注。在癌症数据集的基因选择领域,已经开发出了不同类型的特征选择技术,包括策略(过滤、包装和嵌入式)和标签信息(监督、无监督和半监督)。然而,使用混合特征选择仍然可以提高性能。在本文中,我们提出了一种基于过滤和包装策略的混合特征选择方法。在过滤阶段,我们开发了一种基于非凸正则化非负矩阵分解和结构学习的无监督特征选择方法,我们称之为 NCNMFSL。在包装阶段,我们首次利用蘑菇繁殖优化(MRO)来获得信息量最大的特征子集。在这种混合特征选择方法中,通过 NCNMFSL 过滤掉不相关的特征,通过 MRO 选择最具判别力的特征。为了展示所提出方法的有效性和优越性,我们在 Breast、Heart、Colon、Leukemia、Prostate、Tox-171 和 GLI-85 基准数据集上进行了数值实验。我们利用 SVM 和决策树分类器来分析所提出的技术,对于 Breast、Heart、Colon、Leukemia、Prostate、Tox-171 和 GLI-85 数据集,最高准确率分别为 0.97、0.84、0.98、0.95、0.98、0.87 和 0.85。计算结果表明,与现有的特征选择技术相比,所提出的方法具有有效性。