Yale University, United States of America.
Yale University, United States of America.
Neural Netw. 2022 Aug;152:34-43. doi: 10.1016/j.neunet.2022.04.002. Epub 2022 Apr 12.
Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported. Our code is publically available at https://github.com/jsvir/lscae.
现代数据集通常包含大量相关特征和干扰特征子集,这些特征与数据的主要底层结构不相关或松散相关。干扰特征可以使用拉普拉斯得分准则来识别,该准则通过其与图拉普拉斯特征向量的一致性来评估给定特征的重要性。我们证明,在存在大量干扰特征的情况下,必须在所选特征子集上而不是在完整特征集上计算拉普拉斯。为此,我们提出了一种完全可微分的无监督特征选择方法,利用拉普拉斯得分准则来避免选择干扰特征。我们使用自动编码器架构来处理相关特征,该架构经过训练可从所选特征子集重建数据。基于最近提出的具体层,该层通过架构设计允许控制所选特征的数量,从而简化了优化过程。在几个真实世界数据集上进行实验,我们证明我们提出的方法优于仅旨在避免相关或干扰特征但不两者都避免的类似方法。报告了几个最新的聚类结果。我们的代码可在 https://github.com/jsvir/lscae 上公开获得。