Wang Yadi, Li Xiaoping, Ruiz Ruben
IEEE Trans Cybern. 2023 Feb;53(2):707-717. doi: 10.1109/TCYB.2021.3139898. Epub 2023 Jan 13.
Feature selection (FS) for classification is crucial for large-scale images and bio-microarray data using machine learning. It is challenging to select informative features from high-dimensional data which generally contains many irrelevant and redundant features. These features often impede classifier performance and misdirect classification tasks. In this article, we present an efficient FS algorithm to improve classification accuracy by taking into account both the relevance of the features and the pairwise features correlation in regard to class labels. Based on conditional mutual information and entropy, a new supervised similarity measure is proposed. The supervised similarity measure is connected with feature redundancy minimization evaluation and then combined with feature relevance maximization evaluation. A new criterion max-relevance and min-supervised-redundancy (MRMSR) is introduced and theoretically proved for FS. The proposed MRMSR-based method is compared to seven existing FS approaches on several frequently studied public benchmark datasets. Experimental results demonstrate that the proposal is more effective at selecting informative features and results in better competitive classification performance.
对于使用机器学习的大规模图像和生物微阵列数据而言,用于分类的特征选择(FS)至关重要。从通常包含许多无关和冗余特征的高维数据中选择信息性特征具有挑战性。这些特征常常会妨碍分类器性能并误导分类任务。在本文中,我们提出了一种高效的FS算法,通过同时考虑特征的相关性以及关于类别标签的成对特征相关性来提高分类准确率。基于条件互信息和熵,提出了一种新的监督相似性度量。该监督相似性度量与特征冗余最小化评估相关联,然后与特征相关性最大化评估相结合。引入了一种新的准则——最大相关性和最小监督冗余(MRMSR),并对其进行了FS的理论证明。在几个经常研究的公共基准数据集上,将所提出的基于MRMSR的方法与七种现有的FS方法进行了比较。实验结果表明,该提议在选择信息性特征方面更有效,并能带来更好的竞争性分类性能。