de Souto Marcilio C P, Costa Ivan G, de Araujo Daniel S A, Ludermir Teresa B, Schliep Alexander
Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
BMC Bioinformatics. 2008 Nov 27;9:497. doi: 10.1186/1471-2105-9-497.
The use of clustering methods for the discovery of cancer subtypes has drawn a great deal of attention in the scientific community. While bioinformaticians have proposed new clustering methods that take advantage of characteristics of the gene expression data, the medical community has a preference for using "classic" clustering methods. There have been no studies thus far performing a large-scale evaluation of different clustering methods in this context.
RESULTS/CONCLUSION: We present the first large-scale analysis of seven different clustering methods and four proximity measures for the analysis of 35 cancer gene expression data sets. Our results reveal that the finite mixture of Gaussians, followed closely by k-means, exhibited the best performance in terms of recovering the true structure of the data sets. These methods also exhibited, on average, the smallest difference between the actual number of classes in the data sets and the best number of clusters as indicated by our validation criteria. Furthermore, hierarchical methods, which have been widely used by the medical community, exhibited a poorer recovery performance than that of the other methods evaluated. Moreover, as a stable basis for the assessment and comparison of different clustering methods for cancer gene expression data, this study provides a common group of data sets (benchmark data sets) to be shared among researchers and used for comparisons with new methods. The data sets analyzed in this study are available at http://algorithmics.molgen.mpg.de/Supplements/CompCancer/.
使用聚类方法来发现癌症亚型已在科学界引起了广泛关注。虽然生物信息学家提出了利用基因表达数据特征的新聚类方法,但医学界更倾向于使用“经典”聚类方法。到目前为止,尚无研究在这种背景下对不同聚类方法进行大规模评估。
结果/结论:我们首次对七种不同的聚类方法和四种相近性度量进行了大规模分析,以分析35个癌症基因表达数据集。我们的结果表明,高斯有限混合模型紧随k均值聚类法之后,在恢复数据集的真实结构方面表现最佳。根据我们的验证标准,这些方法平均而言在数据集的实际类别数量与最佳聚类数量之间的差异也最小。此外,医学界广泛使用的层次聚类方法,其恢复性能比其他评估方法要差。而且,作为评估和比较癌症基因表达数据不同聚类方法的稳定基础,本研究提供了一组共同的数据集(基准数据集)供研究人员共享,并用于与新方法进行比较。本研究中分析的数据集可从http://algorithmics.molgen.mpg.de/Supplements/CompCancer/获取。