Dotan-Cohen Dikla, Kasif Simon, Melkman Avraham A
Department of Computer Science, Ben-Gurion University, Beer Sheva, Israel 84105.
Bioinformatics. 2009 Jul 15;25(14):1789-95. doi: 10.1093/bioinformatics/btp327. Epub 2009 Jun 3.
There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity.
We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein-protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein-protein interaction data.
通过将先验知识(例如基因的基因本体论(GO)注释)纳入表达数据的聚类分析中,以提高后续受到审查的聚类的生物学相关性,这种兴趣正在日益增长。GO的结构是另一种背景知识来源,可以通过使用语义相似性来加以利用。
我们在此提出一种新颖的算法,该算法将语义相似性(源自本体结构)整合到从基于表达的层次聚类过程中构建的树状图中推导聚类的过程中。我们的方法可以处理大多数基因具有的来自GO层次结构不同级别的多个注释。此外,它以统一的方式处理带注释和未带注释的基因。因此,通过我们的算法获得的聚类具有显著丰富的注释特征。在交叉验证测试以及使用诸如蛋白质-蛋白质相互作用等外部指标时,我们的算法比以前的方法表现更好。当应用于人类癌症表达数据时,我们的算法尤其识别出与免疫反应和葡萄糖代谢相关的基因聚类。这些聚类也得到了蛋白质-蛋白质相互作用数据的支持。