IEEE Trans Cybern. 2023 Jul;53(7):4232-4244. doi: 10.1109/TCYB.2022.3160244. Epub 2023 Jun 15.
In many real-world unsupervised learning applications, given data with balanced distribution, that is, there are an approximately equal number of instances in each class, we often need to construct a model to reveal such balance. However, in many data, especially the high-dimensional ones, the data in the original feature space often do not present such balance due to the redundant and noisy features. To tackle this problem, we apply an unsupervised spectral feature selection method to select some informative features, which can better reveal the balanced structure of data. Although spectral feature selection is one of the most popular unsupervised feature selection methods and has been widely studied, none of the existing spectral feature selection methods consider the balance property of data. To address this issue, in this article, we propose a novel balanced spectral feature selection (BSFS) method, which not only selects the discriminative features but also picks those to reveal the balanced structure of data. To the best of our knowledge, this is the first spectral feature selection method considering balance structure of data. By introducing a balanced regularization term, we integrate the balanced spectral clustering and feature selection into a unified framework seamlessly. At last, the experiments on benchmark datasets show that the proposed one outperforms the conventional feature selection methods in both clustering performance and balance, which demonstrates the effectiveness and efficiency of the proposed method.
在许多真实世界的无监督学习应用中,给定数据具有平衡分布,即每个类别中大约有相同数量的实例,我们通常需要构建一个模型来揭示这种平衡。然而,在许多数据中,特别是高维数据中,由于冗余和嘈杂的特征,原始特征空间中的数据通常不呈现这种平衡。为了解决这个问题,我们应用了一种无监督的谱特征选择方法来选择一些信息丰富的特征,这些特征可以更好地揭示数据的平衡结构。尽管谱特征选择是最流行的无监督特征选择方法之一,并得到了广泛的研究,但现有的谱特征选择方法都没有考虑数据的平衡特性。为了解决这个问题,在本文中,我们提出了一种新的平衡谱特征选择(BSFS)方法,该方法不仅选择了有鉴别力的特征,而且选择了那些能够揭示数据平衡结构的特征。据我们所知,这是第一个考虑数据平衡结构的谱特征选择方法。通过引入平衡正则化项,我们将平衡谱聚类和特征选择无缝地集成到一个统一的框架中。最后,在基准数据集上的实验表明,所提出的方法在聚类性能和平衡方面都优于传统的特征选择方法,这证明了所提出方法的有效性和效率。