Duan Fenghai, Zhang Heping
Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034, USA.
Bioinformatics. 2004 Jul 22;20(11):1766-71. doi: 10.1093/bioinformatics/bth169. Epub 2004 May 27.
Due to the existence of the loss of synchrony in cell-cycle data sets, standard clustering methods (e.g. k-means), which group open reading frames (ORFs) based on similar expression levels, are deficient unless the temporal pattern of the expression levels of the ORFs is taken into account.
We propose to improve the performance of the k-means method by assigning a decreasing weight on its variable level and evaluating the 'weighted k-means' on a yeast cell-cycle data set. Protein complexes from a public website are used as biological benchmarks. To compare the k-means clusters with the structures of the protein complexes, we measure the agreement between these two ways of clustering via the adjusted Rand index.
Our results show the time-decreasing weight function--exp[-(1/2)(t(2)/C(2))]--which we assign to the variable level of k-means, generally increases the agreement between protein complexes and k-means clusters when C is near the length of two cell cycles.
由于细胞周期数据集中存在同步性缺失的情况,标准聚类方法(如k均值法)在基于相似表达水平对开放阅读框(ORF)进行分组时存在缺陷,除非考虑ORF表达水平的时间模式。
我们建议通过对k均值法的变量水平赋予递减权重,并在酵母细胞周期数据集上评估“加权k均值法”来提高k均值法的性能。来自公共网站的蛋白质复合物用作生物学基准。为了将k均值聚类与蛋白质复合物的结构进行比较,我们通过调整后的兰德指数来衡量这两种聚类方式之间的一致性。
我们的结果表明,我们赋予k均值变量水平的时间递减权重函数exp[-(1/2)(t(2)/C(2))],当C接近两个细胞周期的长度时,通常会增加蛋白质复合物与k均值聚类之间的一致性。