深度无监督特征选择，通过丢弃无关和相关特征。

Deep unsupervised feature selection by discarding nuisance and correlated features.

机构信息

Yale University, United States of America.

出版信息

Neural Netw. 2022 Aug;152:34-43. doi: 10.1016/j.neunet.2022.04.002. Epub 2022 Apr 12.

DOI:10.1016/j.neunet.2022.04.002

PMID:35500458

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9526895/

Abstract

Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported. Our code is publically available at https://github.com/jsvir/lscae.

摘要

现代数据集通常包含大量相关特征和干扰特征子集，这些特征与数据的主要底层结构不相关或松散相关。干扰特征可以使用拉普拉斯得分准则来识别，该准则通过其与图拉普拉斯特征向量的一致性来评估给定特征的重要性。我们证明，在存在大量干扰特征的情况下，必须在所选特征子集上而不是在完整特征集上计算拉普拉斯。为此，我们提出了一种完全可微分的无监督特征选择方法，利用拉普拉斯得分准则来避免选择干扰特征。我们使用自动编码器架构来处理相关特征，该架构经过训练可从所选特征子集重建数据。基于最近提出的具体层，该层通过架构设计允许控制所选特征的数量，从而简化了优化过程。在几个真实世界数据集上进行实验，我们证明我们提出的方法优于仅旨在避免相关或干扰特征但不两者都避免的类似方法。报告了几个最新的聚类结果。我们的代码可在 https://github.com/jsvir/lscae 上公开获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8966/9526895/00b06f05ad91/nihms-1835690-f0001.jpg

相似文献

Deep unsupervised feature selection by discarding nuisance and correlated features.深度无监督特征选择，通过丢弃无关和相关特征。

Neural Netw. 2022 Aug;152:34-43. doi: 10.1016/j.neunet.2022.04.002. Epub 2022 Apr 12.

Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study.无监督特征选择以识别冠心病患者队列机器学习中的重要国际疾病分类第十版（ICD - 10）和解剖治疗化学分类系统（ATC）编码：回顾性研究

JMIR Med Inform. 2024 Jul 26;12:e52896. doi: 10.2196/52896.

Unsupervised feature selection based on incremental forward iterative Laplacian score.基于增量前向迭代拉普拉斯分数的无监督特征选择

Artif Intell Rev. 2023;56(5):4077-4112. doi: 10.1007/s10462-022-10274-6. Epub 2022 Sep 19.

scBGEDA: deep single-cell clustering analysis via a dual denoising autoencoder with bipartite graph ensemble clustering.scBGEDA：基于双分图集成分聚类的对偶去噪自动编码器的单细胞聚类分析。

Bioinformatics. 2023 Feb 14;39(2). doi: 10.1093/bioinformatics/btad075.

GMHCC: high-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clustering.GMHCC：基于图的多重层次共识聚类的生物分子数据的高通量分析。

Bioinformatics. 2022 May 26;38(11):3020-3028. doi: 10.1093/bioinformatics/btac290.

Laplacian linear discriminant analysis approach to unsupervised feature selection.拉普拉斯线性判别分析方法在无监督特征选择中的应用。

IEEE/ACM Trans Comput Biol Bioinform. 2009 Oct-Dec;6(4):605-14. doi: 10.1109/TCBB.2007.70257.

Unsupervised Adaptive Feature Selection With Binary Hashing.基于二进制哈希的无监督自适应特征选择

IEEE Trans Image Process. 2023;32:838-853. doi: 10.1109/TIP.2023.3234497. Epub 2023 Jan 18.

Deep structural clustering for single-cell RNA-seq data jointly through autoencoder and graph neural network.基于自动编码器和图神经网络的单细胞 RNA-seq 数据深度结构聚类。

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbac018.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Unsupervised domain selective graph convolutional network for preoperative prediction of lymph node metastasis in gastric cancer.无监督域选择图卷积网络用于胃癌术前淋巴结转移预测。

Med Image Anal. 2022 Jul;79:102467. doi: 10.1016/j.media.2022.102467. Epub 2022 Apr 28.

引用本文的文献

Classifying Neuronal Cell Types Based on Shared Electrophysiological Information from Humans and Mice.基于人类和小鼠共享的电生理信息对神经元细胞类型进行分类。

Neuroinformatics. 2024 Oct;22(4):473-486. doi: 10.1007/s12021-024-09675-5. Epub 2024 Jul 8.

DELVE: feature selection for preserving biological trajectories in single-cell data.DELVE：单细胞数据中保留生物轨迹的特征选择。

Nat Commun. 2024 Mar 29;15(1):2765. doi: 10.1038/s41467-024-46773-z.

Feature selection for preserving biological trajectories in single-cell data.单细胞数据中用于保留生物学轨迹的特征选择

bioRxiv. 2023 May 12:2023.05.09.540043. doi: 10.1101/2023.05.09.540043.

RN-Autoencoder: Reduced Noise Autoencoder for classifying imbalanced cancer genomic data.RN-自动编码器：用于对不平衡癌症基因组数据进行分类的降噪自动编码器。

J Biol Eng. 2023 Jan 30;17(1):7. doi: 10.1186/s13036-022-00319-3.

本文引用的文献

Feature Selection and Kernel Learning for Local Learning-Based Clustering.基于局部学习的聚类的特征选择和核学习。

IEEE Trans Pattern Anal Mach Intell. 2011 Aug;33(8):1532-47. doi: 10.1109/TPAMI.2010.215. Epub 2010 Dec 10.

Novel unsupervised feature filtering of biological data.生物数据的新型无监督特征过滤

Bioinformatics. 2006 Jul 15;22(14):e507-13. doi: 10.1093/bioinformatics/btl214.

Singular value decomposition for genome-wide expression data processing and modeling.用于全基因组表达数据处理与建模的奇异值分解

Proc Natl Acad Sci U S A. 2000 Aug 29;97(18):10101-6. doi: 10.1073/pnas.97.18.10101.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验