Pihur Vasyl, Datta Susmita, Datta Somnath
Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40202, USA.
Bioinformatics. 2007 Jul 1;23(13):1607-15. doi: 10.1093/bioinformatics/btm158. Epub 2007 May 5.
Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed.
Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k.
R code for all validation measures and rank aggregation is available from the authors upon request.
Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.
生物学家在微阵列数据分析的探索阶段经常采用聚类技术来发现相关的生物分组。鉴于机器学习文献中存在众多聚类算法,用户可能希望选择一种最适合其数据集或应用的算法。多年来,人们提出了各种验证措施来判断给定聚类算法产生的聚类质量,包括其生物学相关性。不幸的是,给定的聚类算法在一种验证措施下可能表现不佳,而在另一种验证措施下却优于许多其他算法。在实践中,几乎不可能手动综合多种验证措施的结果,特别是当要使用多种措施比较大量聚类算法时。因此需要一种自动且客观的方法来协调排名。
我们使用蒙特卡罗交叉熵算法,通过优化距离准则的加权聚合成功地组合了一组正在考虑的聚类算法的排名。与简单的目视检查相比,所提出的加权排名聚合允许对聚类结果进行更加客观和自动化的评估。我们使用一个模拟数据集以及来自不同平台的三个真实基因表达数据集来说明我们的过程,在这些数据集中,我们通过对10种不同验证措施的综合考察对总共11种聚类算法进行排名。针对给定数量的聚类k以及整个k范围都找到了综合排名。
作者可应要求提供所有验证措施和排名聚合的R代码。
补充信息可在http://www.somnathdatta.org/Supp/RankCluster/supp.htm获取。