Fan Mingyu, Zhang Xiaoqin, Hu Jie, Gu Nannan, Tao Dacheng
IEEE Trans Neural Netw Learn Syst. 2022 Oct;33(10):5859-5872. doi: 10.1109/TNNLS.2021.3071603. Epub 2022 Oct 5.
Feature selection (FS), which aims to identify the most informative subset of input features, is an important approach to dimensionality reduction. In this article, a novel FS framework is proposed for both unsupervised and semisupervised scenarios. To make efficient use of data distribution to evaluate features, the framework combines data structure learning (as referred to as data distribution modeling) and FS in a unified formulation such that the data structure learning improves the results of FS and vice versa. Moreover, two types of data structures, namely the soft and hard data structures, are learned and used in the proposed FS framework. The soft data structure refers to the pairwise weights among data samples, and the hard data structure refers to the estimated labels obtained from clustering or semisupervised classification. Both of these data structures are naturally formulated as regularization terms in the proposed framework. In the optimization process, the soft and hard data structures are learned from data represented by the selected features, and then, the most informative features are reselected by referring to the data structures. In this way, the framework uses the interactions between data structure learning and FS to select the most discriminative and informative features. Following the proposed framework, a new semisupervised FS (SSFS) method is derived and studied in depth. Experiments on real-world data sets demonstrate the effectiveness of the proposed method.
特征选择旨在识别输入特征中最具信息性的子集,是一种重要的降维方法。本文针对无监督和半监督场景提出了一种新颖的特征选择框架。为了有效利用数据分布来评估特征,该框架将数据结构学习(也称为数据分布建模)和特征选择统一在一个公式中,使得数据结构学习改进特征选择的结果,反之亦然。此外,在所提出的特征选择框架中学习并使用了两种类型的数据结构,即软数据结构和硬数据结构。软数据结构指数据样本之间的成对权重,硬数据结构指从聚类或半监督分类中获得的估计标签。这两种数据结构在提出的框架中都自然地被表述为正则化项。在优化过程中,从所选特征表示的数据中学习软数据结构和硬数据结构,然后参考这些数据结构重新选择最具信息性的特征。通过这种方式,该框架利用数据结构学习和特征选择之间的相互作用来选择最具判别力和信息性的特征。基于所提出的框架,推导并深入研究了一种新的半监督特征选择(SSFS)方法。在真实数据集上的实验证明了所提方法的有效性。