使用自适应密度峰值检测的快速聚类

Fast clustering using adaptive density peak detection.

作者信息

Wang Xiao-Feng, Xu Yifan

机构信息

1 Department of Quantitative Health Sciences/Biostatistics Section, Cleveland Clinic Lerner Research Institute, Cleveland, OH, USA.

2 Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA.

出版信息

Stat Methods Med Res. 2017 Dec;26(6):2800-2811. doi: 10.1177/0962280215609948. Epub 2015 Oct 16.

DOI:10.1177/0962280215609948

PMID:26475830

Abstract

Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.

摘要

聚类方法的常见局限性包括算法收敛速度慢、许多内在参数预设定的不稳定性以及对异常值缺乏鲁棒性。最近的一种聚类方法提出了一种基于局部密度的聚类中心快速搜索算法。然而，该算法中关键内在参数的选择并未得到系统研究。由于算法中局部密度的原始定义基于截断计数测度，因此估计“最优”参数相对困难。在本文中，我们提出了一种具有自适应密度峰值检测的聚类方法，其中通过非参数多元核估计来估计局部密度。然后，模型参数能够根据具有统计理论依据的方程进行计算。我们还通过最大化平均轮廓系数开发了一种自动聚类中心选择方法。通过模拟研究和对一些基准基因表达数据集的分析，证明了所提方法的优势和灵活性。该方法只需一步执行，无需任何迭代，因此速度快，在大数据分析中具有很大的应用潜力。我们开发了一个用户友好的R包ADPclust以供公众使用。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用自适应密度峰值检测的快速聚类

Fast clustering using adaptive density peak detection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

使用自适应密度峰值检测的快速聚类

Fast clustering using adaptive density peak detection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献