Nugent Rebecca, Meila Marina
Department of Statistics, Carnegie Mellon University, Pittsburgh, PA, USA.
Methods Mol Biol. 2010;620:369-404. doi: 10.1007/978-1-60761-580-4_12.
In molecular biology, we are often interested in determining the group structure in, e.g., a population of cells or microarray gene expression data. Clustering methods identify groups of similar observations, but the results can depend on the chosen method's assumptions and starting parameter values. In this chapter, we give a broad overview of both attribute- and similarity-based clustering, describing both the methods and their performance. The parametric and nonparametric approaches presented vary in whether or not they require knowing the number of clusters in advance as well as the shapes of the estimated clusters. Additionally, we include a biclustering algorithm that incorporates variable selection into the clustering procedure. We finish with a discussion of some common methods for comparing two clustering solutions (possibly from different methods). The user is advised to devote time and attention to determining the appropriate clustering approach (and any corresponding parameter values) for the specific application prior to analysis.
在分子生物学中,我们常常对确定例如一群细胞或微阵列基因表达数据中的组结构感兴趣。聚类方法可识别相似观测值的组,但结果可能取决于所选方法的假设和起始参数值。在本章中,我们对基于属性和基于相似性的聚类进行了广泛概述,描述了这些方法及其性能。所介绍的参数化和非参数化方法在是否需要预先知道聚类数量以及估计聚类的形状方面有所不同。此外,我们还纳入了一种双聚类算法,该算法将变量选择纳入聚类过程。最后,我们讨论了一些用于比较两个聚类解决方案(可能来自不同方法)的常用方法。建议用户在分析之前花时间并仔细考虑为特定应用确定合适的聚类方法(以及任何相应的参数值)。