Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, California, CA 90089, USA.
J Biomed Inform. 2010 Aug;43(4):550-9. doi: 10.1016/j.jbi.2009.12.006. Epub 2010 Jan 18.
Unraveling the temporal complexity of cellular systems is a challenging task, as the subtle coordination of molecular activities cannot be adequately captured by simple mathematical concepts such as correlation. This paper addresses the challenge with a data-mining approach. We introduce the novel concept of a "frequent temporal association pattern" (FTAP): a set of genes simultaneously exhibit complex temporal expression patterns recurrently across multiple microarray datasets. Such temporal signals are hard to identify in individual microarray datasets, but become significant by their frequent occurrences across multiple datasets. We designed an efficient two-stage algorithm to identify FTAPs. First, for each gene we identify expression trends that occur frequently across multiple datasets. Second, we look for a set of genes that simultaneously exhibit their respective trends recurrently in multiple datasets. We applied this algorithm to 18 yeast time-series microarray datasets. The majority of FTAPs identified by the algorithm are associated with specific biological functions. Moreover, a significant number of patterns include genes that are functionally related but do not exhibit co-expression; such gene groups cannot be captured by clustering algorithms. Our approach offers advantages: (1) it can identify complex associations of temporal trends in gene expression, an important step towards understanding the complex mechanisms governing cellular systems; (2) it is capable of integrating time-series data with different time scales and intervals; and (3) it yields results that are robust against outliers.
揭示细胞系统的时间复杂性是一项具有挑战性的任务,因为分子活动的微妙协调不能被简单的数学概念(如相关性)充分捕捉。本文采用数据挖掘方法解决了这一挑战。我们引入了一个新的概念,即“频繁时间关联模式”(FTAP):一组基因同时在多个微阵列数据集之间表现出复杂的时间表达模式。这种时间信号在单个微阵列数据集中很难识别,但通过在多个数据集中频繁出现而变得显著。我们设计了一种高效的两阶段算法来识别 FTAPs。首先,对于每个基因,我们确定在多个数据集上经常出现的表达趋势。其次,我们寻找一组同时在多个数据集上重复表现其各自趋势的基因。我们将此算法应用于 18 个酵母时间序列微阵列数据集。该算法识别的大多数 FTAPs 都与特定的生物学功能相关。此外,许多模式包括功能相关但不表现共表达的基因;这种基因群不能被聚类算法捕捉。我们的方法具有以下优势:(1)它可以识别基因表达中时间趋势的复杂关联,这是理解细胞系统复杂机制的重要步骤;(2)它能够整合具有不同时间尺度和间隔的时间序列数据;(3)它产生的结果对离群值具有鲁棒性。