基于信息的聚类

Information-based clustering.

作者信息

Slonim Noam, Atwal Gurinder Singh, Tkacik Gasper, Bialek William

机构信息

Joseph Henry Laboratories of Physics, and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.

出版信息

Proc Natl Acad Sci U S A. 2005 Dec 20;102(51):18297-302. doi: 10.1073/pnas.0507432102. Epub 2005 Dec 13.

DOI:10.1073/pnas.0507432102

PMID:16352721

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1317937/

Abstract

In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here, we reformulate the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster "prototype," does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures.

摘要

在一个数据集规模日益庞大的时代，许多不同学科的研究人员已将聚类作为一种数据分析和探索工具。然而，现有的聚类方法通常依赖于关于数据结构的若干重要假设。在此，我们从信息论角度重新构建聚类问题，从而避免了许多此类假设。具体而言，我们的公式化方法无需定义聚类“原型”，不需要先验相似性度量，对数据表示的变化具有不变性，并且能自然地捕捉非线性关系。我们将此方法应用于不同领域，发现它始终能产生比现有算法提取的聚类更连贯的聚类。最后，我们的方法提供了一种基于相似性的集体概念而非传统成对度量的聚类方式。

相似文献

Information-based clustering.

Proc Natl Acad Sci U S A. 2005 Dec 20;102(51):18297-302. doi: 10.1073/pnas.0507432102. Epub 2005 Dec 13.

Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees.

Bioinformatics. 2002 Apr;18(4):536-45. doi: 10.1093/bioinformatics/18.4.536.

Clustering gene-expression data with repeated measurements.

Genome Biol. 2003;4(5):R34. doi: 10.1186/gb-2003-4-5-r34. Epub 2003 Apr 25.

A Novel Information-Theoretic Approach for Variable Clustering and Predictive Modeling Using Dirichlet Process Mixtures.

Sci Rep. 2016 Dec 14;6:38913. doi: 10.1038/srep38913.

Knowledge-assisted recognition of cluster boundaries in gene expression data.

Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.

Minimum spanning trees for gene expression data clustering.

Genome Inform. 2001;12:24-33.

Nearest Neighbor Networks: clustering expression data based on gene neighborhoods.

BMC Bioinformatics. 2007 Jul 12;8:250. doi: 10.1186/1471-2105-8-250.

A neural network-based similarity index for clustering DNA microarray data.

Comput Biol Med. 2003 Jan;33(1):1-15. doi: 10.1016/s0010-4825(02)00032-x.

Adaptive quality-based clustering of gene expression profiles.

Bioinformatics. 2002 May;18(5):735-46. doi: 10.1093/bioinformatics/18.5.735.

Time-synchronized clustering of gene expression trajectories.

Biostatistics. 2009 Jan;10(1):32-45. doi: 10.1093/biostatistics/kxn011. Epub 2008 May 22.

引用本文的文献

Early insight into social network structure predicts climbing the social ladder.

Sci Adv. 2025 Jun 20;11(25):eads2133. doi: 10.1126/sciadv.ads2133.

An unbiased method to partition diverse neuronal responses into functional ensembles reveals interpretable population dynamics during innate social behavior.

bioRxiv. 2024 May 9:2024.05.08.593229. doi: 10.1101/2024.05.08.593229.

Quantifying information of intracellular signaling: progress with machine learning.

Rep Prog Phys. 2022 Jul 12;85(8). doi: 10.1088/1361-6633/ac7a4a.

Synthetic cell-based materials extract positional information from morphogen gradients.

Sci Adv. 2022 Apr 8;8(14):eabl9228. doi: 10.1126/sciadv.abl9228.

A framework for studying behavioral evolution by reconstructing ancestral repertoires.

Elife. 2021 Sep 2;10:e61806. doi: 10.7554/eLife.61806.

Mapping the dynamic transfer functions of eukaryotic gene regulation.

Cell Syst. 2021 Nov 17;12(11):1079-1093.e6. doi: 10.1016/j.cels.2021.08.003. Epub 2021 Aug 31.

Quantifying the compressibility of complex networks.

Proc Natl Acad Sci U S A. 2021 Aug 10;118(32). doi: 10.1073/pnas.2023473118.

Dynamic landscape of protein occupancy across the Escherichia coli chromosome.

PLoS Biol. 2021 Jun 25;19(6):e3001306. doi: 10.1371/journal.pbio.3001306. eCollection 2021 Jun.

The Convex Information Bottleneck Lagrangian.

Entropy (Basel). 2020 Jan 14;22(1):98. doi: 10.3390/e22010098.

Empirical Estimation of Information Measures: A Literature Guide.

Entropy (Basel). 2019 Jul 24;21(8):720. doi: 10.3390/e21080720.

本文引用的文献

Use of logic relationships to decipher protein network organization.

Science. 2004 Dec 24;306(5705):2246-9. doi: 10.1126/science.1103330.

Open source clustering software.

Bioinformatics. 2004 Jun 12;20(9):1453-4. doi: 10.1093/bioinformatics/bth078. Epub 2004 Feb 10.

Network information and connected correlations.

Phys Rev Lett. 2003 Dec 5;91(23):238701. doi: 10.1103/PhysRevLett.91.238701. Epub 2003 Dec 2.

Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.

Nat Genet. 2003 Jun;34(2):166-76. doi: 10.1038/ng1165.

Genomic expression programs in the response of yeast cells to environmental changes.

Mol Biol Cell. 2000 Dec;11(12):4241-57. doi: 10.1091/mbc.11.12.4241.

Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Nat Genet. 2000 May;25(1):25-9. doi: 10.1038/75556.

Exploring the new world of the genome with DNA microarrays.

Nat Genet. 1999 Jan;21(1 Suppl):33-7. doi: 10.1038/4462.

Cluster analysis and display of genome-wide expression patterns.

Proc Natl Acad Sci U S A. 1998 Dec 8;95(25):14863-8. doi: 10.1073/pnas.95.25.14863.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于信息的聚类

Information-based clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献