Yu Zhiwen, Wong Hau-San, Wang Hongqiang
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong.
Bioinformatics. 2007 Nov 1;23(21):2888-96. doi: 10.1093/bioinformatics/btm463. Epub 2007 Sep 14.
Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data.
In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning.
Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.
一致性聚类,也称为聚类集成,是微阵列数据分析的重要技术之一,尤其适用于从微阵列数据中发现类别。与传统聚类算法相比,一致性聚类方法能够整合来自不同聚类解决方案的多个划分,以提高聚类算法的鲁棒性、稳定性、可扩展性和并行性。通过一致性聚类,可以发现基因表达数据中样本的潜在类别。
除了探索一种基于图的一致性聚类(GCC)算法来估计微阵列数据中样本的潜在类别外,我们还设计了一种新的验证指标来确定微阵列数据中的类别数量。据我们所知,这是首次将GCC应用于微阵列数据的类别发现。给定一个预先指定的最大类别数(在本文中表示为K(max)),我们的算法可以根据一个名为修正兰德指数的新聚类验证指标,发现微阵列数据中样本的真实类别数。对基因表达数据的实验表明,我们的新算法能够(i)优于大多数现有算法,(ii)在真实癌症数据集中正确识别类别数量,以及(iii)发现具有生物学意义的样本类别。
可根据要求向于志文索取GCC算法的Matlab源代码。