• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

多少个聚类?一种信息论视角。

How many clusters? An information-theoretic perspective.

作者信息

Still Susanne, Bialek William

机构信息

Department of Physics and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.

出版信息

Neural Comput. 2004 Dec;16(12):2483-506. doi: 10.1162/0899766042321751.

DOI:10.1162/0899766042321751
PMID:15516271
Abstract

Clustering provides a common means of identifying structure in complex data, and there is renewed interest in clustering as a tool for the analysis of large data sets in many fields. A natural question is how many clusters are appropriate for the description of a given system. Traditional approaches to this problem are based on either a framework in which clusters of a particular shape are assumed as a model of the system or on a two-step procedure in which a clustering criterion determines the optimal assignments for a given number of clusters and a separate criterion measures the goodness of the classification to determine the number of clusters. In a statistical mechanics approach, clustering can be seen as a trade-off between energy- and entropy-like terms, with lower temperature driving the proliferation of clusters to provide a more detailed description of the data. For finite data sets, we expect that there is a limit to the meaningful structure that can be resolved and therefore a minimum temperature beyond which we will capture sampling noise. This suggests that correcting the clustering criterion for the bias that arises due to sampling errors will allow us to find a clustering solution at a temperature that is optimal in the sense that we capture maximal meaningful structure--without having to define an external criterion for the goodness or stability of the clustering. We show that in a general information-theoretic framework, the finite size of a data set determines an optimal temperature, and we introduce a method for finding the maximal number of clusters that can be resolved from the data in the hard clustering limit.

摘要

聚类提供了一种识别复杂数据结构的常用方法,并且在许多领域中,作为分析大数据集的工具,聚类再次引起了人们的兴趣。一个自然的问题是,对于描述给定系统而言,多少个聚类才是合适的。解决这个问题的传统方法要么基于这样一种框架,即假设特定形状的聚类作为系统的模型,要么基于一种两步法,其中聚类标准确定给定数量聚类的最优分配,而另一个单独的标准衡量分类的优劣以确定聚类的数量。在统计力学方法中,聚类可以被视为能量项和熵项之间的一种权衡,较低的温度会促使聚类扩散,从而提供对数据更详细的描述。对于有限数据集,我们预计可解析的有意义结构存在一个极限,因此存在一个最低温度,超过这个温度我们将捕捉到采样噪声。这表明校正由于采样误差而产生的偏差的聚类标准,将使我们能够在一个温度下找到聚类解决方案,这个温度在我们捕捉到最大有意义结构的意义上是最优的——而无需为聚类的优劣或稳定性定义外部标准。我们表明,在一个一般的信息论框架中,数据集的有限大小决定了一个最优温度,并且我们引入了一种方法来找到在硬聚类极限下可以从数据中解析出的最大聚类数。

相似文献

1
How many clusters? An information-theoretic perspective.多少个聚类?一种信息论视角。
Neural Comput. 2004 Dec;16(12):2483-506. doi: 10.1162/0899766042321751.
2
Visual MRI: merging information visualization and non-parametric clustering techniques for MRI dataset analysis.可视化磁共振成像:融合信息可视化与非参数聚类技术用于磁共振成像数据集分析。
Artif Intell Med. 2008 Nov;44(3):183-99. doi: 10.1016/j.artmed.2008.06.006. Epub 2008 Sep 4.
3
Modified fuzzy gap statistic for estimating preferable number of clusters in fuzzy k-means clustering.用于估计模糊k均值聚类中最优聚类数的改进模糊间隙统计量
J Biosci Bioeng. 2008 Mar;105(3):273-81. doi: 10.1263/jbb.105.273.
4
Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses.用于评估DNA微阵列数据分析中患者聚类可靠性的随机图谱。
Artif Intell Med. 2006 Jun;37(2):85-109. doi: 10.1016/j.artmed.2006.03.005. Epub 2006 May 23.
5
K-means clustering versus validation measures: a data-distribution perspective.K均值聚类与验证度量:数据分布视角
IEEE Trans Syst Man Cybern B Cybern. 2009 Apr;39(2):318-31. doi: 10.1109/TSMCB.2008.2004559. Epub 2008 Dec 12.
6
Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.聚类验证指标的加权排序聚合:一种蒙特卡洛交叉熵方法。
Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.
7
Cumulative voting consensus method for partitions with variable number of clusters.具有可变聚类数的分区的累积投票共识方法。
IEEE Trans Pattern Anal Mach Intell. 2008 Jan;30(1):160-73. doi: 10.1109/TPAMI.2007.1138.
8
Analysis of a Gibbs sampler method for model-based clustering of gene expression data.一种基于模型的基因表达数据聚类的吉布斯采样器方法分析。
Bioinformatics. 2008 Jan 15;24(2):176-83. doi: 10.1093/bioinformatics/btm562. Epub 2007 Nov 22.
9
Evaluating mixture modeling for clustering: recommendations and cautions.评估聚类的混合模型:建议和注意事项。
Psychol Methods. 2011 Mar;16(1):63-79. doi: 10.1037/a0022673.
10
Clustering of change patterns using Fourier coefficients.使用傅里叶系数对变化模式进行聚类。
Bioinformatics. 2008 Jan 15;24(2):184-91. doi: 10.1093/bioinformatics/btm568. Epub 2007 Nov 19.

引用本文的文献

1
What Do We Gain When Tolerating Loss? The Information Bottleneck Wrings Out Recombination.容忍损失时我们能获得什么?信息瓶颈消除了重组。
Mol Biol Evol. 2025 Mar 5;42(3). doi: 10.1093/molbev/msaf029.
2
The Perception of Similarity, Difference and Opposition.相似性、差异性和对立性的认知。
J Intell. 2023 Aug 24;11(9):172. doi: 10.3390/jintelligence11090172.
3
Starling: Introducing a mesoscopic scale with Confluence for Graph Clustering.斯塔林:引入用于图聚类的汇流的介观尺度。
PLoS One. 2023 Aug 24;18(8):e0290090. doi: 10.1371/journal.pone.0290090. eCollection 2023.
4
Pareto-Optimal Clustering with the Primal Deterministic Information Bottleneck.基于原始确定性信息瓶颈的帕累托最优聚类
Entropy (Basel). 2022 May 30;24(6):771. doi: 10.3390/e24060771.
5
Hidden Hypergraphs, Error-Correcting Codes, and Critical Learning in Hopfield Networks.隐藏超图、纠错码与霍普菲尔德网络中的临界学习
Entropy (Basel). 2021 Nov 11;23(11):1494. doi: 10.3390/e23111494.
6
Bottleneck Problems: An Information and Estimation-Theoretic View.瓶颈问题:信息与估计理论视角
Entropy (Basel). 2020 Nov 20;22(11):1325. doi: 10.3390/e22111325.
7
Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples.利用有限样本估计两个离散非对称变量之间的互信息
Entropy (Basel). 2019 Jun 25;21(6):623. doi: 10.3390/e21060623.
8
Pattern Recognition Analysis Reveals Unique Contrast Sensitivity Isocontours Using Static Perimetry Thresholds Across the Visual Field.模式识别分析利用整个视野的静态视野阈值揭示了独特的对比敏感度等视线轮廓。
Invest Ophthalmol Vis Sci. 2017 Sep 1;58(11):4863-4876. doi: 10.1167/iovs.17-22371.
9
Predictive modeling of EEG time series for evaluating surgery targets in epilepsy patients.用于评估癫痫患者手术靶点的脑电图时间序列预测模型。
Hum Brain Mapp. 2017 May;38(5):2509-2531. doi: 10.1002/hbm.23537. Epub 2017 Feb 16.
10
Systems analysis of high-throughput data.高通量数据的系统分析
Adv Exp Med Biol. 2014;844:153-87. doi: 10.1007/978-1-4939-2095-2_8.