Universidade da Beira Interior, Covilhã, KDBIO Group, INESC-ID, Lisbon, Portugal.
IEEE/ACM Trans Comput Biol Bioinform. 2010 Jan-Mar;7(1):153-65. doi: 10.1109/TCBB.2008.34.
Although most biclustering formulations are NP-hard, in time series expression data analysis, it is reasonable to restrict the problem to the identification of maximal biclusters with contiguous columns, which correspond to coherent expression patterns shared by a group of genes in consecutive time points. This restriction leads to a tractable problem. We propose an algorithm that finds and reports all maximal contiguous column coherent biclusters in time linear in the size of the expression matrix. The linear time complexity of CCC-Biclustering relies on the use of a discretized matrix and efficient string processing techniques based on suffix trees. We also propose a method for ranking biclusters based on their statistical significance and a methodology for filtering highly overlapping and, therefore, redundant biclusters. We report results in synthetic and real data showing the effectiveness of the approach and its relevance in the discovery of regulatory modules. Results obtained using the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress show not only the ability of the proposed methodology to extract relevant information compatible with documented biological knowledge but also the utility of using this algorithm in the study of other environmental stresses and of regulatory modules in general.
虽然大多数双聚类公式都是 NP 难的,但在时间序列表达数据分析中,将问题限制为识别具有连续列的最大双聚类是合理的,这对应于一组基因在连续时间点上共享的连贯表达模式。这种限制导致了一个可解的问题。我们提出了一种算法,该算法可以在线性时间内找到并报告所有最大的连续列相干双聚类,其大小与表达矩阵的大小成正比。CCC-Biclustering 的线性时间复杂度依赖于离散化矩阵和基于后缀树的高效字符串处理技术的使用。我们还提出了一种基于统计显著性对双聚类进行排序的方法,以及一种过滤高度重叠和因此冗余的双聚类的方法。我们在合成数据和真实数据中报告了结果,展示了该方法的有效性及其在发现调控模块方面的相关性。使用酿酒酵母在热应激下发生的转录组表达模式获得的结果不仅表明了所提出的方法能够提取与已有生物学知识兼容的相关信息的能力,还表明了在研究其他环境应激和一般调控模块时使用该算法的实用性。