Chen Jiun-Rung, Chang Ye-In
Dept. of Computer Science and Engineering, National Sun Yat-Sen University, No. 70, Lienhai Rd., Kaohsiung 80424, Taiwan, ROC.
Biosystems. 2009 Jul;97(1):44-59. doi: 10.1016/j.biosystems.2009.04.003. Epub 2009 Apr 23.
Biclustering, which performs simultaneous clustering of rows (e.g., genes) and columns (e.g., conditions), has proved of great value for finding interesting patterns from microarray data. To find biclusters, a model called pCluster was proposed. A pCluster consists of a set of genes and a set of conditions, where the expression levels of these genes have a similar variation under these conditions. Based on this model, most of the previous methods need to compute MDSs (maximum dimension sets) for every two genes in the microarray data. Since the number of genes is far larger than the number of conditions, this step is inefficient. Another method called MicroCluster was proposed. This method does not compute MDSs for every two genes, and transforms the problem into a graph problem. However, it needs to solve the Maximal Clique problem, which is NP-Complete. To avoid the above disadvantages, in this paper, we propose a new method, CE-Tree (Condition-Enumeration Tree), for finding pClusters. Instead of generating MDSs for every two genes, we generate only MDSs for every two conditions. Then, based only on these MDSs, we expand the CE-Tree in a special local breadth-first within global depth-first manner to efficiently find all pClusters. We also utilize the idea of the traditional hash join approach to efficiently support the CE-Tree. From the simulation results, we show that the CE-Tree method could find pClusters more efficiently than those previous methods.
双聚类,即对行(如基因)和列(如条件)同时进行聚类,已被证明对于从微阵列数据中发现有趣的模式具有重要价值。为了找到双聚类,人们提出了一种名为pCluster的模型。一个pCluster由一组基因和一组条件组成,其中这些基因在这些条件下的表达水平具有相似的变化。基于此模型,之前的大多数方法需要为微阵列数据中的每两个基因计算最大维数集(MDSs)。由于基因的数量远大于条件的数量,这一步效率低下。另一种名为MicroCluster的方法被提出来了。该方法不为每两个基因计算MDSs,而是将问题转化为一个图问题。然而,它需要解决最大团问题,这是一个NP完全问题。为了避免上述缺点,在本文中,我们提出了一种新的方法——条件枚举树(CE-Tree)来寻找pClusters。我们不是为每两个基因生成MDSs,而是只为每两个条件生成MDSs。然后,仅基于这些MDSs,我们以一种特殊的全局深度优先内的局部广度优先方式扩展CE-Tree,以有效地找到所有的pClusters。我们还利用传统哈希连接方法的思想来有效地支持CE-Tree。从模拟结果来看,我们表明CE-Tree方法比之前的那些方法能更有效地找到pClusters。