Liu Xiaowen, Wang Lusheng
Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong.
Bioinformatics. 2007 Jan 1;23(1):50-6. doi: 10.1093/bioinformatics/btl560. Epub 2006 Nov 7.
Bi-clustering is an important approach in microarray data analysis. The underlying bases for using bi-clustering in the analysis of gene expression data are (1) similar genes may exhibit similar behaviors only under a subset of conditions, not all conditions, (2) genes may participate in more than one function, resulting in one regulation pattern in one context and a different pattern in another. Using bi-clustering algorithms, one can obtain sets of genes that are co-regulated under subsets of conditions.
We develop a polynomial time algorithm to find an optimal bi-cluster with the maximum similarity score. To our knowledge, this is the first formulation for bi-cluster problems that admits a polynomial time algorithm for optimal solutions. The algorithm works for a special case, where the bi-clusters are approximately squares. We then extend the algorithm to handle various kinds of other cases. Experiments on simulation data and real data show that the new algorithms outperform most of the existing methods in many cases. Our new algorithms have the following advantages: (1) no discretization procedure is required, (2) performs well for overlapping bi-clusters and (3) works well for additive bi-clusters.
The software is available at http://www.cs.cityu.edu.hk/~liuxw/msbe/help.html.
双聚类是微阵列数据分析中的一种重要方法。在基因表达数据分析中使用双聚类的潜在依据是:(1)相似基因可能仅在部分条件下,而非所有条件下表现出相似行为;(2)基因可能参与多种功能,导致在一种情况下呈现一种调控模式,而在另一种情况下呈现不同模式。使用双聚类算法,可以获得在部分条件下共同调控的基因集。
我们开发了一种多项式时间算法来找到具有最大相似性得分的最优双聚类。据我们所知,这是双聚类问题的首个公式化表述,它允许使用多项式时间算法来求解最优解。该算法适用于双聚类近似为正方形的特殊情况。然后我们扩展该算法以处理各种其他情况。对模拟数据和真实数据的实验表明,在许多情况下新算法优于大多数现有方法。我们的新算法具有以下优点:(1)无需离散化过程;(2)对重叠双聚类表现良好;(3)对加性双聚类效果良好。