Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark.
Department of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
Nat Protoc. 2018 Jun;13(6):1429-1444. doi: 10.1038/nprot.2018.038. Epub 2018 May 24.
Clustering is a popular technique for discovering groups of similar objects in large datasets. It is nowadays applied in all areas of life sciences, from biomedicine to physics. However, designing high-quality cluster analyses is a tedious and complicated task with manifold choices along the way. As a cluster analysis is often the first step of a succeeding downstream analysis, the clustering must be reliable, reproducible, and of the highest quality. To address these challenges, we recently developed ClustEval, an integrated and extensible platform for the automated and standardized design and execution of complex cluster analyses. It allows researchers to design and carry out cluster analyses involving a large number of clustering methods applied to many, large datasets. ClustEval helps to shed light on all major aspects of cluster analysis, from choosing the right similarity function to using validity indices and data preprocessing protocols. Only this high degree of automation allows the researcher to easily run a clustering task with many different tools, parameters, and settings in order to gain the best possible outcome. In this paper, we guide the user step by step through three fundamentally important and widely applicable use cases: (i) identification of the best clustering method for a new, user-given protein sequence similarity dataset; (ii) evaluation of the performance of a new, user-given clustering method (densityCut) against the state of the art; and (iii) prediction of the best method for a new protein sequence similarity dataset. This protocol guides the user through the most important features of ClustEval and takes ∼4 h to complete.
聚类是一种在大型数据集发现相似对象的常用技术。如今,它已应用于生命科学的各个领域,从生物医学到物理学。然而,设计高质量的聚类分析是一项繁琐而复杂的任务,需要沿着许多路径做出多种选择。由于聚类分析通常是后续下游分析的第一步,因此聚类必须是可靠的、可重现的,且质量最高的。为了解决这些挑战,我们最近开发了 ClustEval,这是一个集成和可扩展的平台,用于自动化和标准化复杂聚类分析的设计和执行。它允许研究人员设计和执行涉及大量聚类方法的聚类分析,并将其应用于许多大型数据集。ClustEval 有助于阐明聚类分析的所有主要方面,从选择正确的相似性函数到使用有效性指数和数据预处理协议。只有这种高度的自动化才能使研究人员能够轻松地使用许多不同的工具、参数和设置运行聚类任务,以获得最佳的结果。在本文中,我们逐步指导用户完成三个非常重要且广泛适用的用例:(i)为新的用户给定蛋白质序列相似性数据集识别最佳聚类方法;(ii)评估新的用户给定聚类方法(densityCut)与现有方法的性能;以及 (iii)预测新蛋白质序列相似性数据集的最佳方法。该方案指导用户了解 ClustEval 的最重要特征,大约需要 4 小时完成。