PFClust：一种新颖的无参数聚类算法。

PFClust: a novel parameter free clustering algorithm.

机构信息

Biomedical Sciences Research Complex and EaStCHEM School of Chemistry, Purdie Building, University of St Andrews, North Haugh, St Andrews, KY16 9ST, Scotland, UK.

出版信息

BMC Bioinformatics. 2013 Jul 3;14:213. doi: 10.1186/1471-2105-14-213.

DOI:10.1186/1471-2105-14-213

PMID:23819480

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3747858/

Abstract

BACKGROUND

We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of 'correct' cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings.

RESULTS

We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.

CONCLUSIONS

We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.

摘要

背景

我们提出了 PFClust（无参聚类）算法，它能够自动对数据进行聚类，并确定合适的聚类数量，无需用户指定任何参数。该算法将数据集划分为若干个具有共同属性的簇，例如最小期望值和簇内相似性的方差。一组 n 个对象可以被聚类为从 1 到 n 个任意数量的簇，并且有许多不同的层次和分区、凝聚和分裂聚类方法可用于实现此目的。然而，自动确定数据集中存在的聚类数量对于聚类算法来说是一个重大挑战。确定将对象分组到的假定最佳聚类数量涉及计算和评估具有不同聚类数量的一系列聚类。然而，在这种情况下，没有一致或唯一的最佳定义。因此，我们在存在外部黄金标准“正确”聚类定义的数据集上测试 PFClust，并注意到根据其他合理标准，这种聚类可能不是最优的。PFClust 在启发式意义上是不可描述的，因为它不能用任何单一的简单表示的度量来优化可能的聚类空间。

结果

我们首先使用由 2D 向量组成的一些合成数据集来验证 PFClust，结果表明其聚类性能至少与其他六种领先方法相当——尽管其中五种方法提前被告知要使用多少个聚类。我们还展示了 PFClust 使用来自结构生物信息学数据库 CATH 的一组折叠来对蛋白质结构域的三维结构进行分类的能力。

结论

我们表明，PFClust 能够平均更好地聚类测试数据集，比任何其他算法都要好，而且它不需要指定任何外部参数。在合成数据集上的结果表明，PFClust 生成了有意义的聚类，而我们的算法与从 CATH 中提取的数据集的正确分配也具有很好的一致性，CATH 是蛋白质结构域结构的手动分类部分。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a1cb/3747858/d8c11a926a20/1471-2105-14-213-1.jpg

相似文献

PFClust: a novel parameter free clustering algorithm.PFClust：一种新颖的无参数聚类算法。

BMC Bioinformatics. 2013 Jul 3;14:213. doi: 10.1186/1471-2105-14-213.

PFClust: an optimised implementation of a parameter-free clustering algorithm.PFClust：一种无参数聚类算法的优化实现。

Source Code Biol Med. 2014 Feb 4;9(1):5. doi: 10.1186/1751-0473-9-5.

Cross-over between discrete and continuous protein structure space: insights into automatic classification and networks of protein structures.离散与连续蛋白质结构空间之间的交叉：对蛋白质结构自动分类及网络的见解。

PLoS Comput Biol. 2009 Mar;5(3):e1000331. doi: 10.1371/journal.pcbi.1000331. Epub 2009 Mar 27.

Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.聚类组学：针对异构数据集的整合上下文相关聚类

PLoS Comput Biol. 2017 Oct 16;13(10):e1005781. doi: 10.1371/journal.pcbi.1005781. eCollection 2017 Oct.

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.一种用于生物信息学中不完整数据集的多核密度聚类算法。

BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.

Knowledge-assisted recognition of cluster boundaries in gene expression data.基因表达数据中聚类边界的知识辅助识别。

Artif Intell Med. 2005 Sep-Oct;35(1-2):171-83. doi: 10.1016/j.artmed.2005.02.007.

A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal Number of Clusters.一种用于确定最优聚类数的自适应模糊均值算法

Comput Intell Neurosci. 2016;2016:2647389. doi: 10.1155/2016/2647389. Epub 2016 Nov 29.

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.用于对海量数据集进行精确层次聚类的高效算法：攻克整个蛋白质空间

Bioinformatics. 2008 Jul 1;24(13):i41-9. doi: 10.1093/bioinformatics/btn174.

Metric for measuring the effectiveness of clustering of DNA microarray expression.用于测量 DNA 微阵列表达聚类有效性的度量。

BMC Bioinformatics. 2006 Sep 6;7 Suppl 2(Suppl 2):S5. doi: 10.1186/1471-2105-7-S2-S5.

Interactive visual exploration and refinement of cluster assignments.聚类分配的交互式可视化探索与优化。

BMC Bioinformatics. 2017 Sep 12;18(1):406. doi: 10.1186/s12859-017-1813-7.

引用本文的文献

A multiple kernel density clustering algorithm for incomplete datasets in bioinformatics.一种用于生物信息学中不完整数据集的多核密度聚类算法。

BMC Syst Biol. 2018 Nov 22;12(Suppl 6):111. doi: 10.1186/s12918-018-0630-6.

Predictions of Backbone Dynamics in Intrinsically Disordered Proteins Using De Novo Fragment-Based Protein Structure Predictions.使用从头开始的基于片段的蛋白质结构预测来预测无规卷曲蛋白质的骨架动力学。

Sci Rep. 2017 Aug 1;7(1):6999. doi: 10.1038/s41598-017-07156-1.

Drug Design for CNS Diseases: Polypharmacological Profiling of Compounds Using Cheminformatic, 3D-QSAR and Virtual Screening Methodologies.中枢神经系统疾病的药物设计：使用化学信息学、3D-QSAR和虚拟筛选方法对化合物进行多药理学分析。

Front Neurosci. 2016 Jun 10;10:265. doi: 10.3389/fnins.2016.00265. eCollection 2016.

Predicting targets of compounds against neurological diseases using cheminformatic methodology.使用化学信息学方法预测化合物针对神经疾病的靶点。

J Comput Aided Mol Des. 2015 Feb;29(2):183-98. doi: 10.1007/s10822-014-9816-1. Epub 2014 Nov 26.

SMART: unique splitting-while-merging framework for gene clustering.SMART：用于基因聚类的独特的边合并边分裂框架。

PLoS One. 2014 Apr 8;9(4):e94141. doi: 10.1371/journal.pone.0094141. eCollection 2014.

PFClust: an optimised implementation of a parameter-free clustering algorithm.PFClust：一种无参数聚类算法的优化实现。

Source Code Biol Med. 2014 Feb 4;9(1):5. doi: 10.1186/1751-0473-9-5.

Predicting the protein targets for athletic performance-enhancing substances.预测具有增强运动表现作用的物质的蛋白质靶标。

J Cheminform. 2013 Jun 25;5(1):31. doi: 10.1186/1758-2946-5-31.

本文引用的文献

Representing and comparing protein folds and fold families using three-dimensional shape-density representations.使用三维形状密度表示法来呈现和比较蛋白质折叠及折叠家族。

Proteins. 2012 Feb;80(2):530-45. doi: 10.1002/prot.23218. Epub 2011 Nov 12.

Proteins: sequence to structure and function--current status.蛋白质：从序列到结构和功能——现状。

Curr Protein Pept Sci. 2010 Nov;11(7):498-514. doi: 10.2174/138920310794109094.

The Pfam protein families database.Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2010 Jan;38(Database issue):D211-22. doi: 10.1093/nar/gkp985. Epub 2009 Nov 17.

3D-blast: 3D protein structure alignment, comparison, and classification using spherical polar Fourier correlations.3D-blast：使用球极傅里叶相关性进行三维蛋白质结构比对、比较和分类。

Pac Symp Biocomput. 2010:281-92.

The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies.重温CATH分类——超家族中结构差异的架构综述及新表征方法

Nucleic Acids Res. 2009 Jan;37(Database issue):D310-4. doi: 10.1093/nar/gkn877. Epub 2008 Nov 7.

Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.用于微阵列数据分析的计算聚类验证：Clest、共识聚类、品质因数、间隙统计和模型探索器的实验评估。

BMC Bioinformatics. 2008 Oct 29;9:462. doi: 10.1186/1471-2105-9-462.

Some new indexes of cluster validity.一些新的聚类有效性指标。

IEEE Trans Syst Man Cybern B Cybern. 1998;28(3):301-15. doi: 10.1109/3477.678624.

The Protein Data Bank: a historical perspective.蛋白质数据库：历史视角

Acta Crystallogr A. 2008 Jan;64(Pt 1):88-95. doi: 10.1107/S0108767307035623. Epub 2007 Dec 21.

Computational cluster validation in post-genomic data analysis.后基因组数据分析中的计算聚类验证

Bioinformatics. 2005 Aug 1;21(15):3201-12. doi: 10.1093/bioinformatics/bti517. Epub 2005 May 24.

A hybrid clustering approach to recognition of protein families in 114 microbial genomes.一种用于识别114个微生物基因组中蛋白质家族的混合聚类方法。

BMC Bioinformatics. 2004 Apr 29;5:45. doi: 10.1186/1471-2105-5-45.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

PFClust：一种新颖的无参数聚类算法。

PFClust: a novel parameter free clustering algorithm.

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献