Bar-Joseph Ziv, Gerber Georg K, Gifford David K, Jaakkola Tommi S, Simon Itamar
MIT Laboratory for Computer Science, 200 Technology Square, Cambridge, MA 02139, USA.
J Comput Biol. 2003;10(3-4):341-56. doi: 10.1089/10665270360688057.
We present algorithms for time-series gene expression analysis that permit the principled estimation of unobserved time points, clustering, and dataset alignment. Each expression profile is modeled as a cubic spline (piecewise polynomial) that is estimated from the observed data and every time point influences the overall smooth expression curve. We constrain the spline coefficients of genes in the same class to have similar expression patterns, while also allowing for gene specific parameters. We show that unobserved time points can be reconstructed using our method with 10-15% less error when compared to previous best methods. Our clustering algorithm operates directly on the continuous representations of gene expression profiles, and we demonstrate that this is particularly effective when applied to nonuniformly sampled data. Our continuous alignment algorithm also avoids difficulties encountered by discrete approaches. In particular, our method allows for control of the number of degrees of freedom of the warp through the specification of parameterized functions, which helps to avoid overfitting. We demonstrate that our algorithm produces stable low-error alignments on real expression data and further show a specific application to yeast knock-out data that produces biologically meaningful results.
我们提出了用于时间序列基因表达分析的算法,这些算法允许对未观察到的时间点进行有原则的估计、聚类以及数据集对齐。每个表达谱被建模为一个三次样条(分段多项式),它是根据观察到的数据估计出来的,并且每个时间点都会影响整体平滑的表达曲线。我们将同一类基因的样条系数约束为具有相似的表达模式,同时也允许基因特异性参数。我们表明,与之前的最佳方法相比,使用我们的方法可以以低10 - 15%的误差重建未观察到的时间点。我们的聚类算法直接对基因表达谱的连续表示进行操作,并且我们证明,当应用于非均匀采样数据时,这特别有效。我们的连续对齐算法也避免了离散方法所遇到的困难。特别是,我们的方法允许通过参数化函数的指定来控制扭曲的自由度数量,这有助于避免过拟合。我们证明我们的算法在真实表达数据上产生稳定的低误差对齐,并进一步展示了其在酵母基因敲除数据上的具体应用,该应用产生了具有生物学意义的结果。