近似大数据划分的邓恩聚类有效性指数。

Approximating Dunn's Cluster Validity Indices for Partitions of Big Data.

出版信息

IEEE Trans Cybern. 2019 May;49(5):1629-1641. doi: 10.1109/TCYB.2018.2806886. Epub 2018 Mar 5.

DOI:10.1109/TCYB.2018.2806886

Abstract

Dunn's internal cluster validity index is used to assess partition quality and subsequently identify a "best" crisp partition of n objects. Computing Dunn's index (DI) for partitions of n p -dimensional feature vector data has quadratic time complexity O(pn) , so its computation is impractical for very large values of n . This note presents six methods for approximating DI. Four methods are based on Maximin sampling, which identifies a skeleton of the full partition that contains some boundary points in each cluster. Two additional methods are presented that estimate boundary points associated with unsupervised training of one class support vector machines. Numerical examples compare approximations to DI based on all six methods. Four experiments on seven real and synthetic data sets support our assertion that computing approximations to DI with an incremental, neighborhood-based Maximin skeleton is both tractable and reliably accurate.

摘要

邓恩内部聚类有效性指数用于评估分区质量，并随后确定 n 个对象的“最佳”清晰分区。计算 n 个 p 维特征向量数据的邓恩指数 (DI) 的时间复杂度为 O(pn)，因此对于非常大的 n 值，其计算是不切实际的。本说明介绍了六种逼近 DI 的方法。四种方法基于最大最小抽样，该方法确定包含每个聚类中的一些边界点的完整分区的骨架。另外两种方法提出了使用无监督训练一类支持向量机来估计边界点。数值示例比较了基于所有六种方法的 DI 逼近。对七个真实和合成数据集的四项实验支持我们的断言，即使用基于增量和基于邻域的最大最小骨架计算 DI 的逼近是可行的，并且具有可靠的准确性。