Bar-Joseph Ziv, Demaine Erik D, Gifford David K, Srebro Nathan, Hamel Angèle M, Jaakkola Tommi S
Laboratory for Computer Science, MIT, 545 Technology Square, Cambridge, MA 02139, USA.
Bioinformatics. 2003 Jun 12;19(9):1070-8. doi: 10.1093/bioinformatics/btg030.
A major challenge in gene expression analysis is effective data organization and visualization. One of the most popular tools for this task is hierarchical clustering. Hierarchical clustering allows a user to view relationships in scales ranging from single genes to large sets of genes, while at the same time providing a global view of the expression data. However, hierarchical clustering is very sensitive to noise, it usually lacks of a method to actually identify distinct clusters, and produces a large number of possible leaf orderings of the hierarchical clustering tree. In this paper we propose a new hierarchical clustering algorithm which reduces susceptibility to noise, permits up to k siblings to be directly related, and provides a single optimal order for the resulting tree.
We present an algorithm that efficiently constructs a k-ary tree, where each node can have up to k children, and then optimally orders the leaves of that tree. By combining k clusters at each step our algorithm becomes more robust against noise and missing values. By optimally ordering the leaves of the resulting tree we maintain the pairwise relationships that appear in the original method, without sacrificing the robustness. Our k-ary construction algorithm runs in O(n(3)) regardless of k and our ordering algorithm runs in O(4(k)n(3)). We present several examples that show that our k-ary clustering algorithm achieves results that are superior to the binary tree results in both global presentation and cluster identification.
We have implemented the above algorithms in C++ on the Linux operating system.
基因表达分析中的一个主要挑战是有效的数据组织和可视化。用于此任务的最流行工具之一是层次聚类。层次聚类允许用户查看从单个基因到大量基因集合范围内的关系,同时提供表达数据的全局视图。然而,层次聚类对噪声非常敏感,通常缺乏实际识别不同簇的方法,并且会产生层次聚类树的大量可能叶排序。在本文中,我们提出了一种新的层次聚类算法,该算法降低了对噪声的敏感性,允许多达k个兄弟节点直接相关,并为所得树提供单个最优排序。
我们提出了一种算法,该算法能高效地构建一棵k叉树,其中每个节点最多可有k个子节点,然后对该树的叶进行最优排序。通过在每个步骤合并k个簇,我们的算法对噪声和缺失值变得更具鲁棒性。通过对所得树的叶进行最优排序,我们保持了原始方法中出现的成对关系,而不牺牲鲁棒性。我们的k叉构建算法无论k为何值都在O(n(3))时间内运行,我们的排序算法在O(4(k)n(3))时间内运行。我们给出了几个示例,表明我们的k叉聚类算法在全局呈现和簇识别方面都取得了优于二叉树结果的效果。
我们已在Linux操作系统上用C++实现了上述算法。