一种用于估计层次聚类中簇数量的数据驱动方法。

A data-driven approach to estimating the number of clusters in hierarchical clustering.

作者信息

Zambelli Antoine E

机构信息

Quantech Solutions LLC, San Rafael, CA, USA.

出版信息

F1000Res. 2016 Dec 1;5. doi: 10.12688/f1000research.10103.1. eCollection 2016.

DOI:10.12688/f1000research.10103.1

PMID:28408972

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5373427/

Abstract

DNA microarray and gene expression problems often require a researcher to perform clustering on their data in a bid to better understand its structure. In cases where the number of clusters is not known, one can resort to hierarchical clustering methods. However, there currently exist very few automated algorithms for determining the true number of clusters in the data. We propose two new methods (mode and maximum difference) for estimating the number of clusters in a hierarchical clustering framework to create a fully automated process with no human intervention. These methods are compared to the established elbow and gap statistic algorithms using simulated datasets and the Biobase Gene ExpressionSet. We also explore a data mixing procedure inspired by cross validation techniques. We find that the overall performance of the maximum difference method is comparable or greater to that of the gap statistic in multi-cluster scenarios, and achieves that performance at a fraction of the computational cost. This method also responds well to our mixing procedure, which opens the door to future research. We conclude that both the mode and maximum difference methods warrant further study related to their mixing and cross-validation potential. We particularly recommend the use of the maximum difference method in multi-cluster scenarios given its accuracy and execution times, and present it as an alternative to existing algorithms.

摘要

DNA微阵列和基因表达问题常常要求研究人员对其数据进行聚类，以便更好地理解数据结构。在聚类数量未知的情况下，可以采用层次聚类方法。然而，目前几乎没有自动算法可用于确定数据中聚类的真实数量。我们提出了两种新方法（众数法和最大差异法），用于在层次聚类框架中估计聚类数量，以创建一个无需人工干预的完全自动化过程。我们使用模拟数据集和Biobase基因表达集，将这些方法与既定的肘部法和间隙统计算法进行了比较。我们还探索了一种受交叉验证技术启发的数据混合程序。我们发现，在多聚类场景中，最大差异法的总体性能与间隙统计法相当或更优，并且以一小部分计算成本实现了该性能。该方法对我们的数据混合程序也有良好响应，这为未来的研究打开了大门。我们得出结论，众数法和最大差异法在其混合和交叉验证潜力方面都值得进一步研究。鉴于其准确性和执行时间，我们特别推荐在多聚类场景中使用最大差异法，并将其作为现有算法的替代方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8ab6/5373427/15f77aa997b7/f1000research-5-10884-g0000.jpg

相似文献

A data-driven approach to estimating the number of clusters in hierarchical clustering.一种用于估计层次聚类中簇数量的数据驱动方法。

F1000Res. 2016 Dec 1;5. doi: 10.12688/f1000research.10103.1. eCollection 2016.

Determining the number of clusters using the weighted gap statistic.使用加权间隙统计量确定聚类的数量。

Biometrics. 2007 Dec;63(4):1031-7. doi: 10.1111/j.1541-0420.2007.00784.x. Epub 2007 Apr 9.

Modified fuzzy gap statistic for estimating preferable number of clusters in fuzzy k-means clustering.用于估计模糊k均值聚类中最优聚类数的改进模糊间隙统计量

J Biosci Bioeng. 2008 Mar;105(3):273-81. doi: 10.1263/jbb.105.273.

Estimating the number of clusters in DNA microarray data.估算DNA微阵列数据中的聚类数量。

Methods Inf Med. 2006;45(2):153-7.

Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm.使用一致性算法对大型DNA微阵列数据集进行稳健的多尺度聚类

Bioinformatics. 2006 Jan 1;22(1):58-67. doi: 10.1093/bioinformatics/bti746. Epub 2005 Oct 27.

Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach.聚类验证指标的加权排序聚合：一种蒙特卡洛交叉熵方法。

Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.

Subtyping of children with developmental dyslexia via bootstrap aggregated clustering and the gap statistic: comparison with the double-deficit hypothesis.通过自助聚合聚类和间隙统计对发育性阅读障碍儿童进行亚型分类：与双重缺陷假说的比较

Int J Lang Commun Disord. 2007 Jan-Feb;42(1):77-95. doi: 10.1080/13682820600806680.

Knowledge-assisted recognition of cluster boundaries in gene expression data.基因表达数据中聚类边界的知识辅助识别。

Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.

Efficient determination of cluster boundaries for analysis of gene expression profile data using hierarchical clustering and wavelet transform.使用层次聚类和小波变换高效确定用于基因表达谱数据分析的聚类边界。

Genome Inform. 2005;16(1):132-41.

Clustering of gene expression data: performance and similarity analysis.基因表达数据的聚类：性能与相似性分析

BMC Bioinformatics. 2006 Dec 12;7 Suppl 4(Suppl 4):S19. doi: 10.1186/1471-2105-7-S4-S19.

引用本文的文献

Identifying subphenotypes of patients undergoing post-operative delirium assessment.识别接受术后谵妄评估患者的亚表型。

Alzheimers Dement. 2025 Jul;21(7):e70516. doi: 10.1002/alz.70516.

User Engagement Clusters of an 8-Week Digital Mental Health Intervention Guided by a Relational Agent (Woebot): Exploratory Study.基于关系代理（Woebot）指导的 8 周数字心理健康干预的用户参与群：探索性研究。

J Med Internet Res. 2023 Oct 13;25:e47198. doi: 10.2196/47198.

Applications of monitoring and tracing the evolution of clustering solutions in dynamic datasets.动态数据集中聚类解决方案演变的监测与追踪应用。

J Appl Stat. 2021 Dec 7;50(4):1017-1035. doi: 10.1080/02664763.2021.2008882. eCollection 2023.

Epidemiologic Utility of a Framework for Partition Number Selection When Dissecting Hierarchically Clustered Genetic Data Evaluated on the Intestinal Parasite Cyclospora cayetanensis.当在肠道寄生虫环孢子虫（Cyclospora cayetanensis）上评估分层聚类遗传数据时，用于划分数量选择的框架的流行病学效用。

Am J Epidemiol. 2023 May 5;192(5):772-781. doi: 10.1093/aje/kwad006.

Evaluation of neighborhood resources and mental health in American military Veterans using geographic information systems.利用地理信息系统评估美国退伍军人的邻里资源与心理健康状况。

Prev Med Rep. 2021 Sep 3;24:101546. doi: 10.1016/j.pmedr.2021.101546. eCollection 2021 Dec.

Using Multilayer Heterogeneous Networks to Infer Functions of Phosphorylated Sites.利用多层异质网络推断磷酸化位点的功能。

J Proteome Res. 2021 Jul 2;20(7):3532-3548. doi: 10.1021/acs.jproteome.1c00150. Epub 2021 Jun 24.

A comparison of prospective space-time scan statistics and spatiotemporal event sequence based clustering for COVID-19 surveillance.用于新冠病毒疾病监测的前瞻性时空扫描统计与基于时空事件序列聚类的比较

PLoS One. 2021 Jun 10;16(6):e0252990. doi: 10.1371/journal.pone.0252990. eCollection 2021.

Identifying Content-Based Engagement Patterns in a Smoking Cessation Website and Associations With User Characteristics and Cessation Outcomes: A Sequence and Cluster Analysis.基于内容的参与模式在戒烟网站中的识别及其与用户特征和戒烟结果的关联：序列和聚类分析。

Nicotine Tob Res. 2021 Jun 8;23(7):1103-1112. doi: 10.1093/ntr/ntab008.

Powered Two-Wheeler Riding Profile Clustering for an In-Depth Study of Bend-Taking Practices.动力两轮车骑行模式聚类研究，深入探讨弯道行驶习惯。

Sensors (Basel). 2020 Nov 23;20(22):6696. doi: 10.3390/s20226696.

Parents' beliefs about and associations to their elementary children's home technology usage.父母对其小学阶段子女在家使用科技产品的看法及关联因素。

Educ Inf Technol (Dordr). 2020;25(5):4557-4574. doi: 10.1007/s10639-020-10188-2. Epub 2020 Apr 22.

本文引用的文献

Orchestrating high-throughput genomic analysis with Bioconductor.使用Bioconductor编排高通量基因组分析。

Nat Methods. 2015 Feb;12(2):115-21. doi: 10.1038/nmeth.3252.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

一种用于估计层次聚类中簇数量的数据驱动方法。

A data-driven approach to estimating the number of clusters in hierarchical clustering.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献