Department of Statistics at University of California, Los Angeles, CA 90095, USA.
Bioinformatics. 2010 Feb 1;26(3):341-7. doi: 10.1093/bioinformatics/btp671. Epub 2009 Dec 9.
Various clustering methods have been applied to microarray gene expression data for identifying genes with similar expression profiles. As the biological annotation data accumulated, more and more genes have been organized into functional categories. Functionally related genes may be regulated by common cellular signals, thus likely to be co-expressed. Consequently, utilizing the rapidly increasing functional annotation resources such as Gene Ontology (GO) to improve the performance of clustering methods is of great interest. On the opposite side of clustering, there are genes that have distinct expression profiles and do not co-express with other genes. Identification of these scattered genes could enhance the performance of clustering methods.
We developed a new clustering algorithm, Dynamically Weighted Clustering with Noise set (DWCN), which makes use of gene annotation information and allows for a set of scattered genes, the noise set, to be left out of the main clusters. We tested the DWCN method and contrasted its results with those obtained using several common clustering techniques on a simulated dataset as well as on two public datasets: the Stanford yeast cell-cycle gene expression data, and a gene expression dataset for a group of genetically different yeast segregants.
Our method produces clusters with more consistent functional annotations and more coherent expression patterns than existing clustering techniques.
Supplementary data are available at Bioinformatics online.
各种聚类方法已被应用于微阵列基因表达数据,以识别具有相似表达谱的基因。随着生物注释数据的积累,越来越多的基因被组织成功能类别。功能相关的基因可能受到共同的细胞信号的调节,因此可能会共表达。因此,利用基因本体论 (GO) 等快速增长的功能注释资源来提高聚类方法的性能是非常有意义的。在聚类的对立面,有一些具有独特表达谱且不与其他基因共表达的基因。识别这些分散的基因可以提高聚类方法的性能。
我们开发了一种新的聚类算法,即具有噪声集的动态加权聚类 (DWCN),该算法利用基因注释信息,并允许将一组分散的基因,即噪声集,排除在主要聚类之外。我们在模拟数据集以及两个公共数据集上测试了 DWCN 方法,并将其结果与几种常用聚类技术的结果进行了对比:斯坦福酵母细胞周期基因表达数据集,以及一组遗传上不同的酵母分离子的基因表达数据集。
与现有聚类技术相比,我们的方法产生的聚类具有更一致的功能注释和更一致的表达模式。
补充数据可在生物信息学在线获得。