Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, KY16 9ST, Scotland, UK.
BMC Bioinformatics. 2013 Jul 3;14:213. doi: 10.1186/1471-2105-14-213.
We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of 'correct' cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings.
We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.
We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.
我们提出了 PFClust(无参聚类)算法,它能够自动对数据进行聚类,并确定合适的聚类数量,无需用户指定任何参数。该算法将数据集划分为若干个具有共同属性的簇,例如最小期望值和簇内相似性的方差。一组 n 个对象可以被聚类为从 1 到 n 个任意数量的簇,并且有许多不同的层次和分区、凝聚和分裂聚类方法可用于实现此目的。然而,自动确定数据集中存在的聚类数量对于聚类算法来说是一个重大挑战。确定将对象分组到的假定最佳聚类数量涉及计算和评估具有不同聚类数量的一系列聚类。然而,在这种情况下,没有一致或唯一的最佳定义。因此,我们在存在外部黄金标准“正确”聚类定义的数据集上测试 PFClust,并注意到根据其他合理标准,这种聚类可能不是最优的。PFClust 在启发式意义上是不可描述的,因为它不能用任何单一的简单表示的度量来优化可能的聚类空间。
我们首先使用由 2D 向量组成的一些合成数据集来验证 PFClust,结果表明其聚类性能至少与其他六种领先方法相当——尽管其中五种方法提前被告知要使用多少个聚类。我们还展示了 PFClust 使用来自结构生物信息学数据库 CATH 的一组折叠来对蛋白质结构域的三维结构进行分类的能力。
我们表明,PFClust 能够平均更好地聚类测试数据集,比任何其他算法都要好,而且它不需要指定任何外部参数。在合成数据集上的结果表明,PFClust 生成了有意义的聚类,而我们的算法与从 CATH 中提取的数据集的正确分配也具有很好的一致性,CATH 是蛋白质结构域结构的手动分类部分。