Champalimaud Research, Champalimaud Centre for the Unknown, Avenida Brasília, Doca de Pedrouços, Lisboa, Portugal.
Rowland Institute at Harvard, 100 Edwin H. Land Boulevard, Cambridge, MA, USA.
Bioinformatics. 2019 Jun 1;35(12):2125-2132. doi: 10.1093/bioinformatics/bty932.
How to partition a dataset into a set of distinct clusters is a ubiquitous and challenging problem. The fact that data vary widely in features such as cluster shape, cluster number, density distribution, background noise, outliers and degree of overlap, makes it difficult to find a single algorithm that can be broadly applied. One recent method, clusterdp, based on search of density peaks, can be applied successfully to cluster many kinds of data, but it is not fully automatic, and fails on some simple data distributions.
We propose an alternative approach, clusterdv, which estimates density dips between points, and allows robust determination of cluster number and distribution across a wide range of data, without any manual parameter adjustment. We show that this method is able to solve a range of synthetic and experimental datasets, where the underlying structure is known, and identifies consistent and meaningful clusters in new behavioral data.
The clusterdv is implemented in Matlab. Its source code, together with example datasets are available on: https://github.com/jcbmarques/clusterdv.
Supplementary data are available at Bioinformatics online.
如何将数据集划分为一组不同的簇是一个普遍而具有挑战性的问题。数据在簇形状、簇数量、密度分布、背景噪声、异常值和重叠程度等方面差异很大,这使得很难找到一种可以广泛应用的单一算法。最近的一种方法 clusterdp 基于密度峰的搜索,可以成功地应用于聚类许多种类的数据,但它不是完全自动的,并且在一些简单的数据分布上失败。
我们提出了一种替代方法 clusterdv,它估计点之间的密度凹陷,并允许在广泛的数据范围内稳健地确定簇的数量和分布,而无需任何手动参数调整。我们表明,该方法能够解决一系列已知基础结构的合成和实验数据集,并在新的行为数据中识别出一致且有意义的簇。
clusterdv 是用 Matlab 实现的。它的源代码以及示例数据集可在 https://github.com/jcbmarques/clusterdv 上获得。
补充数据可在 Bioinformatics 在线获得。