一种用于聚类中特征选择的框架。

A framework for feature selection in clustering.

作者信息

Witten Daniela M, Tibshirani Robert

出版信息

J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.

DOI:10.1198/jasa.2010.tm09415

PMID:20811510

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2930825/

Abstract

We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets.

摘要

我们考虑使用可能大量的特征集对观测值进行聚类的问题。人们可能会期望数据中真正潜在的聚类仅在一小部分特征上有所不同，如果使用全部特征集对观测值进行聚类，这些聚类可能会被遗漏。我们提出了一种用于稀疏聚类的新颖框架，其中使用自适应选择的特征子集对观测值进行聚类。该方法使用套索型惩罚来选择特征。我们使用这个框架来开发用于稀疏K均值和稀疏层次聚类的简单方法。一个单一的标准同时控制特征的选择和最终的聚类。这些方法在模拟数据和基因组数据集上得到了验证。

相似文献

A framework for feature selection in clustering.

J Am Stat Assoc. 2010 Jun 1;105(490):713-726. doi: 10.1198/jasa.2010.tm09415.

Detecting Meaningful Clusters From High-Dimensional Data: A Strongly Consistent Sparse Center-Based Clustering Approach.

IEEE Trans Pattern Anal Mach Intell. 2022 Jun;44(6):2894-2908. doi: 10.1109/TPAMI.2020.3047489. Epub 2022 May 5.

A Practical Guide to Sparse -Means Clustering for Studying Molecular Development of the Human Brain.

Front Neurosci. 2021 Nov 16;15:668293. doi: 10.3389/fnins.2021.668293. eCollection 2021.

Feature selection and semi-supervised clustering using multiobjective optimization.

Springerplus. 2014 Aug 26;3:465. doi: 10.1186/2193-1801-3-465. eCollection 2014.

Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data.

J Mach Learn Res. 2021 Jan;22.

Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty.

J Mach Learn Res. 2013 Jul 1;14(7):1865.

Breaking the hierarchy--a new cluster selection mechanism for hierarchical clustering methods.

Algorithms Mol Biol. 2009 Oct 19;4:12. doi: 10.1186/1748-7188-4-12.

Chemistry domain of applicability evaluation against existing estrogen receptor high-throughput assay-based activity models.

Front Toxicol. 2024 Apr 17;6:1346767. doi: 10.3389/ftox.2024.1346767. eCollection 2024.

How many clusters? An information-theoretic perspective.

Neural Comput. 2004 Dec;16(12):2483-506. doi: 10.1162/0899766042321751.

Sparse kernel -means clustering.

J Appl Stat. 2024 Jun 5;52(1):158-182. doi: 10.1080/02664763.2024.2362266. eCollection 2025.

引用本文的文献

OUTCOME-GUIDED DISEASE SUBTYPING BY GENERATIVE MODEL AND WEIGHTED JOINT LIKELIHOOD IN TRANSCRIPTOMIC APPLICATIONS.

Ann Appl Stat. 2024 Sep;18(3):1947-1964. doi: 10.1214/23-aoas1865. Epub 2024 Aug 5.

A diagnostic algorithm for inherited metabolic disorders using untargeted metabolomics.

Metabolomics. 2025 Jul 27;21(4):101. doi: 10.1007/s11306-025-02302-7.

A Comprehensive Review of Deep Learning Applications with Multi-Omics Data in Cancer Research.

Genes (Basel). 2025 May 28;16(6):648. doi: 10.3390/genes16060648.

Sharp-SSL: Selective High-Dimensional Axis-Aligned Random Projections for Semi-Supervised Learning.

J Am Stat Assoc. 2024 Apr 12;120(549):395-407. doi: 10.1080/01621459.2024.2340792. eCollection 2025.

Spatially Informed Nonnegative Matrix Trifactorization for Coclustering Mass Spectrometry Data.

Biom J. 2025 Apr;67(2):e70031. doi: 10.1002/bimj.70031.

Exploring BIRC family genes as prognostic biomarkers and therapeutic targets in prostate cancer.

Discov Oncol. 2025 Feb 26;16(1):240. doi: 10.1007/s12672-025-02002-7.

An unsupervised learning approach for clustering joint trajectories of Alzheimer's disease biomarkers: An application to ADNI Data.

Alzheimers Dement. 2025 Feb;21(2):e14524. doi: 10.1002/alz.14524. Epub 2025 Jan 27.

Outcome-guided Bayesian clustering for disease subtype discovery using high-dimensional transcriptomic data.

J Appl Stat. 2024 Jun 7;52(1):183-207. doi: 10.1080/02664763.2024.2362275. eCollection 2025.

Sparse kernel -means clustering.

J Appl Stat. 2024 Jun 5;52(1):158-182. doi: 10.1080/02664763.2024.2362266. eCollection 2025.

Higher-Order Disease Interactions in Multimorbidity Measurement: Marginal Benefit Over Additive Disease Summation.

J Gerontol A Biol Sci Med Sci. 2024 Dec 11;80(1). doi: 10.1093/gerona/glae282.

本文引用的文献

Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables.

Electron J Stat. 2008;2:168-212. doi: 10.1214/08-EJS194.

A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

Biostatistics. 2009 Jul;10(3):515-34. doi: 10.1093/biostatistics/kxp008. Epub 2009 Apr 17.

Variable selection for clustering with Gaussian mixture models.

Biometrics. 2009 Sep;65(3):701-9. doi: 10.1111/j.1541-0420.2008.01160.x. Epub 2009 Feb 4.

Complementary hierarchical clustering.

Biostatistics. 2008 Jul;9(3):467-83. doi: 10.1093/biostatistics/kxm046. Epub 2007 Dec 18.

Variable selection for model-based high-dimensional clustering and its application to microarray data.

Biometrics. 2008 Jun;64(2):440-8. doi: 10.1111/j.1541-0420.2007.00922.x. Epub 2007 Oct 26.

A second generation human haplotype map of over 3.1 million SNPs.

Nature. 2007 Oct 18;449(7164):851-61. doi: 10.1038/nature06258.

Metagene projection for cross-platform, cross-species characterization of global transcriptional states.

Proc Natl Acad Sci U S A. 2007 Apr 3;104(14):5959-64. doi: 10.1073/pnas.0701068104. Epub 2007 Mar 27.

Principal components analysis corrects for stratification in genome-wide association studies.

Nat Genet. 2006 Aug;38(8):904-9. doi: 10.1038/ng1847. Epub 2006 Jul 23.

Hybrid hierarchical clustering with applications to microarray data.

Biostatistics. 2006 Apr;7(2):286-301. doi: 10.1093/biostatistics/kxj007. Epub 2005 Nov 21.

A haplotype map of the human genome.

Nature. 2005 Oct 27;437(7063):1299-320. doi: 10.1038/nature04226.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于聚类中特征选择的框架。

A framework for feature selection in clustering.

作者信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献