Samuel Lattimore B, van Dongen Stijn, Crabbe M James C
School of Animal and Microbial Sciences, University of Reading, Whiteknights, Reading RG6 6AJ, UK.
Comput Biol Chem. 2005 Oct;29(5):354-9. doi: 10.1016/j.compbiolchem.2005.07.002. Epub 2005 Sep 19.
Accurately and reliably identifying the actual number of clusters present with a dataset of gene expression profiles, when no additional information on cluster structure is available, is a problem addressed by few algorithms. GeneMCL transforms microarray analysis data into a graph consisting of nodes connected by edges, where the nodes represent genes, and the edges represent the similarity in expression of those genes, as given by a proximity measurement. This measurement is taken to be the Pearson correlation coefficient combined with a local non-linear rescaling step. The resulting graph is input to the Markov Cluster (MCL) algorithm, which is an elegant, deterministic, non-specific and scalable method, which models stochastic flow through the graph. The algorithm is inherently affected by any cluster structure present, and rapidly decomposes a graph into cohesive clusters. The potential of the GeneMCL algorithm is demonstrated with a 5,730 gene subset (IGS) of the Van't Veer breast cancer database, for which the clusterings are shown to reflect underlying biological mechanisms.
在没有关于聚类结构的额外信息时,准确可靠地识别基因表达谱数据集中实际存在的聚类数量,这一问题只有少数算法能够解决。GeneMCL将微阵列分析数据转换为一个由通过边连接的节点组成的图,其中节点代表基因,边代表这些基因表达的相似性,由一种接近度测量给出。这种测量采用皮尔逊相关系数并结合局部非线性重缩放步骤。所得的图被输入到马尔可夫聚类(MCL)算法中,该算法是一种优雅、确定性、非特异性且可扩展的方法,它对通过图的随机流进行建模。该算法本质上会受到任何存在的聚类结构的影响,并能迅速将一个图分解为凝聚性聚类。通过范特·维尔乳腺癌数据库的一个包含5730个基因的子集(IGS)展示了GeneMCL算法的潜力,其聚类结果显示反映了潜在的生物学机制。