Madeira Sara C, Oliveira Arlindo L
Knowledge Discovery and Bioinformatics (KDBIO) group, INESC-ID, Lisbon, Portugal.
Algorithms Mol Biol. 2009 Jun 4;4:8. doi: 10.1186/1748-7188-4-8.
The ability to monitor the change in expression patterns over time, and to observe the emergence of coherent temporal responses using gene expression time series, obtained from microarray experiments, is critical to advance our understanding of complex biological processes. In this context, biclustering algorithms have been recognized as an important tool for the discovery of local expression patterns, which are crucial to unravel potential regulatory mechanisms. Although most formulations of the biclustering problem are NP-hard, when working with time series expression data the interesting biclusters can be restricted to those with contiguous columns. This restriction leads to a tractable problem and enables the design of efficient biclustering algorithms able to identify all maximal contiguous column coherent biclusters.
In this work, we propose e-CCC-Biclustering, a biclustering algorithm that finds and reports all maximal contiguous column coherent biclusters with approximate expression patterns in time polynomial in the size of the time series gene expression matrix. This polynomial time complexity is achieved by manipulating a discretized version of the original matrix using efficient string processing techniques. We also propose extensions to deal with missing values, discover anticorrelated and scaled expression patterns, and different ways to compute the errors allowed in the expression patterns. We propose a scoring criterion combining the statistical significance of expression patterns with a similarity measure between overlapping biclusters.
We present results in real data showing the effectiveness of e-CCC-Biclustering and its relevance in the discovery of regulatory modules describing the transcriptomic expression patterns occurring in Saccharomyces cerevisiae in response to heat stress. In particular, the results show the advantage of considering approximate patterns when compared to state of the art methods that require exact matching of gene expression time series.
The identification of co-regulated genes, involved in specific biological processes, remains one of the main avenues open to researchers studying gene regulatory networks. The ability of the proposed methodology to efficiently identify sets of genes with similar expression patterns is shown to be instrumental in the discovery of relevant biological phenomena, leading to more convincing evidence of specific regulatory mechanisms.
A prototype implementation of the algorithm coded in Java together with the dataset and examples used in the paper is available in http://kdbio.inesc-id.pt/software/e-ccc-biclustering.
通过微阵列实验获得基因表达时间序列,监测表达模式随时间的变化,并观察连贯的时间响应的出现,对于深化我们对复杂生物过程的理解至关重要。在此背景下,双聚类算法已被视为发现局部表达模式的重要工具,而这些局部表达模式对于揭示潜在的调控机制至关重要。尽管双聚类问题的大多数公式都是NP难问题,但在处理时间序列表达数据时,有趣的双聚类可以限制为列连续的双聚类。这种限制导致了一个可处理的问题,并使得能够设计出高效的双聚类算法,能够识别所有最大的列连续连贯双聚类。
在这项工作中,我们提出了e-CCC-双聚类算法,这是一种双聚类算法,它能在时间序列基因表达矩阵大小的时间多项式内找到并报告所有具有近似表达模式的最大列连续连贯双聚类。这种多项式时间复杂度是通过使用高效的字符串处理技术处理原始矩阵的离散化版本来实现的。我们还提出了扩展方法来处理缺失值、发现反相关和缩放后的表达模式,以及计算表达模式中允许的误差的不同方法。我们提出了一种评分标准,将表达模式的统计显著性与重叠双聚类之间的相似性度量相结合。
我们在真实数据中展示了e-CCC-双聚类算法的有效性及其在发现描述酿酒酵母热应激转录组表达模式的调控模块中的相关性。特别是,结果表明与需要精确匹配基因表达时间序列的现有方法相比,考虑近似模式的优势。
识别参与特定生物过程的共调控基因仍然是研究基因调控网络的研究人员的主要途径之一。所提出的方法有效识别具有相似表达模式的基因集的能力在发现相关生物现象中发挥了作用,为特定调控机制提供了更有说服力的证据。
该算法的Java原型实现以及论文中使用的数据集和示例可在http://kdbio.inesc-id.pt/software/e-ccc-biclustering获得。