• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

下一代K均值算法。

The next-generation K-means algorithm.

作者信息

Demidenko Eugene

机构信息

Department of Biomedical Data Science and Department of Mathematics Dartmouth College Hanover New Hampshire.

出版信息

Stat Anal Data Min. 2018 Aug;11(4):153-166. doi: 10.1002/sam.11379. Epub 2018 May 11.

DOI:10.1002/sam.11379
PMID:30073045
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6062903/
Abstract

Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.

摘要

通常,当提及基于模型的分类时,人们理解的是混合分布方法。相比之下,我们复兴了由班菲尔德和拉夫蒂(1993年)开发的基于硬分类模型的方法,对于该方法,K均值等同于最大似然(ML)估计。下一代K均值算法在完成分类后并不会结束,而是继续前进以回答以下基本问题:是否存在聚类,有多少个聚类,估计均值和索引集的统计属性是什么,聚类回归中系数的分布是什么,以及如何对多级数据进行分类?基于统计模型的K均值算法方法是关键,因为它允许进行统计模拟并按照经典统计学的思路研究分类的属性。本文阐述了ML分类在检验无聚类假设、使用模拟研究选择聚类数量的各种方法、使用拉普拉斯分布进行稳健聚类、研究聚类回归中系数的属性,以及最终通过将方差分量模型与K均值相结合来处理多级数据方面的应用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/ecad156db4d0/SAM-11-153-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/f6ffc7241746/SAM-11-153-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/27fd67e1f1e4/SAM-11-153-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/c42f050f4e1b/SAM-11-153-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/9967ac1eda71/SAM-11-153-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/441a65074ef3/SAM-11-153-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/af3cfb44437d/SAM-11-153-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/d3b875d3acfb/SAM-11-153-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/73e6a44a90b0/SAM-11-153-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/e9bfcc1c4f71/SAM-11-153-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/33beefb677a8/SAM-11-153-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/ecad156db4d0/SAM-11-153-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/f6ffc7241746/SAM-11-153-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/27fd67e1f1e4/SAM-11-153-g010.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/c42f050f4e1b/SAM-11-153-g011.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/9967ac1eda71/SAM-11-153-g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/441a65074ef3/SAM-11-153-g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/af3cfb44437d/SAM-11-153-g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/d3b875d3acfb/SAM-11-153-g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/73e6a44a90b0/SAM-11-153-g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/e9bfcc1c4f71/SAM-11-153-g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/33beefb677a8/SAM-11-153-g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1589/6062903/ecad156db4d0/SAM-11-153-g002.jpg

相似文献

1
The next-generation K-means algorithm.下一代K均值算法。
Stat Anal Data Min. 2018 Aug;11(4):153-166. doi: 10.1002/sam.11379. Epub 2018 May 11.
2
What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm.当K均值聚类失败时该怎么办:一种简单而有原则的替代算法。
PLoS One. 2016 Sep 26;11(9):e0162259. doi: 10.1371/journal.pone.0162259. eCollection 2016.
3
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
4
MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.MULTI-K:使用集成 k-均值聚类进行微阵列亚型的准确分类。
BMC Bioinformatics. 2009 Aug 22;10:260. doi: 10.1186/1471-2105-10-260.
5
Clustering performance comparison using -means and expectation maximization algorithms.使用K均值算法和期望最大化算法的聚类性能比较。
Biotechnol Biotechnol Equip. 2014 Nov 14;28(sup1):S44-S48. doi: 10.1080/13102818.2014.949045. Epub 2014 Nov 6.
6
A Novel Model on Reinforce K-Means Using Location Division Model and Outlier of Initial Value for Lowering Data Cost.一种基于位置划分模型和初始值离群点强化K均值的新型模型,用于降低数据成本。
Entropy (Basel). 2020 Aug 17;22(8):902. doi: 10.3390/e22080902.
7
Combining Max pooling-Laplacian theory and means clustering for novel camouflage pattern design.结合最大池化 - 拉普拉斯理论与均值聚类进行新型伪装图案设计。
Front Neurorobot. 2022 Nov 18;16:1041101. doi: 10.3389/fnbot.2022.1041101. eCollection 2022.
8
How to Use Model-Based Cluster Analysis Efficiently in Person-Oriented Research.如何在以人为本的研究中高效运用基于模型的聚类分析
J Pers Oriented Res. 2021 Aug 26;7(1):22-35. doi: 10.17505/jpor.2021.23449. eCollection 2021.
9
A differential privacy protecting K-means clustering algorithm based on contour coefficients.基于轮廓系数的差分隐私保护 K-均值聚类算法。
PLoS One. 2018 Nov 21;13(11):e0206832. doi: 10.1371/journal.pone.0206832. eCollection 2018.
10
Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves.线性变换与k均值聚类算法:在曲线聚类中的应用
Am Stat. 2007 Feb;61(1):34-40. doi: 10.1198/000313007X171016.

引用本文的文献

1
Ovarian Cancer: Multi-Omics Data Integration.卵巢癌:多组学数据整合
Int J Mol Sci. 2025 Jun 21;26(13):5961. doi: 10.3390/ijms26135961.
2
The man, the plant, and the insect: shooting host specificity determinants in pangenome.人类、植物和昆虫:泛基因组中宿主特异性决定因素的研究
Front Microbiol. 2023 Sep 12;14:1211999. doi: 10.3389/fmicb.2023.1211999. eCollection 2023.
3
Differential Transcriptomic Landscapes of SARS-CoV-2 Variants in Multiple Organs from Infected Rhesus Macaques.感染恒河猴多个器官中新冠病毒变异株的差异转录组图谱

本文引用的文献

1
ClusterSignificance: a bioconductor package facilitating statistical analysis of class cluster separations in dimensionality reduced data.ClusterSignificance:一个生物信息学软件包,用于促进降维数据中类簇分离的统计分析。
Bioinformatics. 2017 Oct 1;33(19):3126-3128. doi: 10.1093/bioinformatics/btx393.
2
Statistical significance for hierarchical clustering.层次聚类的统计学显著性。
Biometrics. 2017 Sep;73(3):811-821. doi: 10.1111/biom.12647. Epub 2017 Jan 18.
3
PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS.基于多元混合分析的模式聚类
Genomics Proteomics Bioinformatics. 2023 Oct;21(5):1014-1029. doi: 10.1016/j.gpb.2023.06.002. Epub 2023 Jul 13.
4
scmA-seq reveals single-cell landscapes of the dynamic mA during oocyte maturation and early embryonic development.scmA-seq 揭示了卵母细胞成熟和早期胚胎发育过程中动态 mA 的单细胞图谱。
Nat Commun. 2023 Jan 19;14(1):315. doi: 10.1038/s41467-023-35958-7.
5
Differential transcriptomic landscapes of multiple organs from SARS-CoV-2 early infected rhesus macaques.SARS-CoV-2 早期感染恒河猴多个器官的差异转录组图谱。
Protein Cell. 2022 Dec;13(12):920-939. doi: 10.1007/s13238-022-00915-5. Epub 2022 Apr 4.
6
Monitoring of SARS-CoV-2 seroprevalence among primary healthcare patients in the Barcelona Metropolitan Area: the SeroCAP sentinel network protocol.巴塞罗那大都市区基层医疗保健患者中 SARS-CoV-2 血清流行率监测:SeroCAP 哨点网络方案。
BMJ Open. 2022 Feb 9;12(2):e053237. doi: 10.1136/bmjopen-2021-053237.
7
The Distribution of Several Genomic Virulence Determinants Does Not Corroborate the Established Serotyping Classification of .几种基因组毒力决定因素的分布与已建立的血清型分类不相符。
Int J Mol Sci. 2021 Feb 24;22(5):2244. doi: 10.3390/ijms22052244.
8
A Novel Machine Learning Framework for Comparison of Viral COVID-19-Related Sina Weibo and Twitter Posts: Workflow Development and Content Analysis.一种用于比较病毒性 COVID-19 相关微博和推特帖子的新型机器学习框架:工作流程开发和内容分析。
J Med Internet Res. 2021 Jan 6;23(1):e24889. doi: 10.2196/24889.
Multivariate Behav Res. 1970 Apr 1;5(3):329-50. doi: 10.1207/s15327906mbr0503_6.
4
Microarray enriched gene rank.微阵列富集基因排名
BioData Min. 2015 Jan 17;8(1):2. doi: 10.1186/s13040-014-0033-1. eCollection 2015.
5
If cell mechanics can be described by elastic modulus: study of different models and probes used in indentation experiments.如果细胞力学可以用弹性模量来描述:压痕实验中使用的不同模型和探针的研究。
Biophys J. 2014 Aug 5;107(3):564-575. doi: 10.1016/j.bpj.2014.06.033.
6
Prognostically relevant gene signatures of high-grade serous ovarian carcinoma.高级别浆液性卵巢癌的预后相关基因特征。
J Clin Invest. 2013 Jan;123(1):517-25. doi: 10.1172/JCI65833. Epub 2012 Dec 21.
7
Integrated genomic analyses of ovarian carcinoma.卵巢癌的综合基因组分析。
Nature. 2011 Jun 29;474(7353):609-15. doi: 10.1038/nature10166.
8
Detection of cancerous cervical cells using physical adhesion of fluorescent silica particles and centripetal force.利用荧光硅粒子的物理吸附和向心力检测癌变宫颈细胞。
Analyst. 2011 Apr 7;136(7):1502-6. doi: 10.1039/c0an00366b. Epub 2011 Feb 8.