Gibbons Francis D, Roth Frederick P
Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, Massachusetts 02115, USA.
Genome Res. 2002 Oct;12(10):1574-81. doi: 10.1101/gr.397002.
We compare several commonly used expression-based gene clustering algorithms using a figure of merit based on the mutual information between cluster membership and known gene attributes. By studying various publicly available expression data sets we conclude that enrichment of clusters for biological function is, in general, highest at rather low cluster numbers. As a measure of dissimilarity between the expression patterns of two genes, no method outperforms Euclidean distance for ratio-based measurements, or Pearson distance for non-ratio-based measurements at the optimal choice of cluster number. We show the self-organized-map approach to be best for both measurement types at higher numbers of clusters. Clusters of genes derived from single- and average-linkage hierarchical clustering tend to produce worse-than-random results.
我们使用基于聚类成员与已知基因属性之间互信息的品质因数,比较了几种常用的基于表达的基因聚类算法。通过研究各种公开可用的表达数据集,我们得出结论,一般来说,在聚类数量相当低时,生物学功能聚类的富集程度最高。作为两个基因表达模式之间差异的度量,在聚类数量的最佳选择下,对于基于比率的测量,没有哪种方法比欧几里得距离更优;对于非基于比率的测量,没有哪种方法比皮尔逊距离更优。我们表明,在聚类数量较多时,自组织映射方法对于这两种测量类型都是最佳的。源自单链和平均链层次聚类的基因聚类往往产生比随机结果更差的结果。