Bar-Joseph Ziv, Gerber Georg, Simon Itamar, Gifford David K, Jaakkola Tommi S
Laboratory for Computer Science, Massachusetts Institute of Technology, 200 Technology Square, Cambridge, MA 02139, USA.
Proc Natl Acad Sci U S A. 2003 Sep 2;100(18):10146-51. doi: 10.1073/pnas.1732547100. Epub 2003 Aug 21.
We present a general algorithm to detect genes differentially expressed between two nonhomogeneous time-series data sets. As increasing amounts of high-throughput biological data become available, a major challenge in genomic and computational biology is to develop methods for comparing data from different experimental sources. Time-series whole-genome expression data are a particularly valuable source of information because they can describe an unfolding biological process such as the cell cycle or immune response. However, comparisons of time-series expression data sets are hindered by biological and experimental inconsistencies such as differences in sampling rate, variations in the timing of biological processes, and the lack of repeats. Our algorithm overcomes these difficulties by using a continuous representation for time-series data and combining a noise model for individual samples with a global difference measure. We introduce a corresponding statistical method for computing the significance of this differential expression measure. We used our algorithm to compare cell-cycle-dependent gene expression in wild-type and knockout yeast strains. Our algorithm identified a set of 56 differentially expressed genes, and these results were validated by using independent protein-DNA-binding data. Unlike previous methods, our algorithm was also able to identify 22 non-cell-cycle-regulated genes as differentially expressed. This set of genes is significantly correlated in a set of independent expression experiments, suggesting additional roles for the transcription factors Fkh1 and Fkh2 in controlling cellular activity in yeast.
我们提出了一种通用算法,用于检测两个非齐次时间序列数据集之间差异表达的基因。随着越来越多的高通量生物学数据可用,基因组学和计算生物学面临的一个主要挑战是开发比较来自不同实验来源数据的方法。时间序列全基因组表达数据是一种特别有价值的信息来源,因为它们可以描述一个正在展开的生物学过程,如细胞周期或免疫反应。然而,时间序列表达数据集的比较受到生物学和实验不一致性的阻碍,如采样率差异、生物过程时间的变化以及缺乏重复。我们的算法通过使用时间序列数据的连续表示,并将单个样本的噪声模型与全局差异度量相结合来克服这些困难。我们引入了一种相应的统计方法来计算这种差异表达度量的显著性。我们使用我们的算法比较野生型和基因敲除酵母菌株中细胞周期依赖性基因的表达。我们的算法识别出一组56个差异表达基因,并且这些结果通过使用独立的蛋白质-DNA结合数据得到了验证。与以前的方法不同,我们的算法还能够将22个非细胞周期调节基因识别为差异表达基因。这组基因在一组独立的表达实验中显著相关,表明转录因子Fkh1和Fkh2在控制酵母细胞活性方面有额外作用。