Yeung K Y, Fraley C, Murua A, Raftery A E, Ruzzo W L
Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195, USA.
Bioinformatics. 2001 Oct;17(10):977-87. doi: 10.1093/bioinformatics/17.10.977.
Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits.
MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development.
聚类是分析基因表达数据的一种有用的探索性技术。在此背景下,已经提出了许多不同的启发式聚类算法。基于概率模型的聚类算法为启发式算法提供了一种有原则的替代方法。特别是,基于模型的聚类假设数据是由诸如多元正态分布等潜在概率分布的有限混合生成的。在概率框架中,选择“好的”聚类方法和确定“正确的”聚类数量的问题被简化为模型选择问题。高斯混合模型已被证明是许多应用中聚类的强大工具。
我们在几个有外部评估标准的合成和真实基因表达数据集上对基于模型的聚类性能进行了基准测试。基于模型的方法在我们的合成数据集上具有卓越的性能,始终能选择正确的模型和聚类数量。在真实表达数据上,基于模型的方法产生的聚类质量与领先的启发式聚类算法相当,但具有能给出聚类数量和合适模型的关键优势。我们还探讨了高斯混合假设在真实数据不同变换上的有效性。我们还评估了这些真实基因表达数据集在进行常用数据变换前后对多元高斯分布的拟合程度。适当选择的变换似乎能带来合理的拟合。
MCLUST可在http://www.stat.washington.edu/fraley/mclust获取。对角模型的软件正在开发中。