Tsiporkova Elena, Boeva Veselka
Department of Molecular Genetics, Ghent University, Technologiepark 927, 9052 Ghent, Belgium.
J Bioinform Comput Biol. 2007 Oct;5(5):1005-22. doi: 10.1142/s0219720007003053.
Gene expression microarray experiments frequently generate datasets with multiple values missing. However, most of the analysis, mining, and classification methods for gene expression data require a complete matrix of gene array values. Therefore, the accurate estimation of missing values in such datasets has been recognized as an important issue, and several imputation algorithms have already been proposed to the biological community. Most of these approaches, however, are not particularly suitable for time series expression profiles. In view of this, we propose a novel imputation algorithm, which is specially suited for the estimation of missing values in gene expression time series data. The algorithm utilizes Dynamic Time Warping (DTW) distance in order to measure the similarity between time expression profiles, and subsequently selects for each gene expression profile with missing values a dedicated set of candidate profiles for estimation. Three different DTW-based imputation (DTWimpute) algorithms have been considered: position-wise, neighborhood-wise, and two-pass imputation. These have initially been prototyped in Perl, and their accuracy has been evaluated on yeast expression time series data using several different parameter settings. The experiments have shown that the two-pass algorithm consistently outperforms, in particular for datasets with a higher level of missing entries, the neighborhood-wise and the position-wise algorithms. The performance of the two-pass DTWimpute algorithm has further been benchmarked against the weighted K-Nearest Neighbors algorithm, which is widely used in the biological community; the former algorithm has appeared superior to the latter one. Motivated by these findings, indicating clearly the added value of the DTW techniques for missing value estimation in time series data, we have built an optimized C++ implementation of the two-pass DTWimpute algorithm. The software also provides for a choice between three different initial rough imputation methods.
基因表达微阵列实验经常生成存在多个缺失值的数据集。然而,大多数用于基因表达数据的分析、挖掘和分类方法都需要一个完整的基因阵列值矩阵。因此,准确估计此类数据集中的缺失值已被视为一个重要问题,并且已经向生物界提出了几种插补算法。然而,这些方法中的大多数并不特别适用于时间序列表达谱。鉴于此,我们提出了一种新颖的插补算法,它特别适用于估计基因表达时间序列数据中的缺失值。该算法利用动态时间规整(DTW)距离来测量时间表达谱之间的相似性,随后为每个具有缺失值的基因表达谱选择一组专用的候选谱进行估计。已考虑三种不同的基于DTW的插补(DTWimpute)算法:逐位置插补、邻域插补和两遍插补。这些算法最初用Perl编写了原型,并使用几种不同的参数设置在酵母表达时间序列数据上评估了它们的准确性。实验表明,两遍插补算法始终表现更优,特别是对于具有较高缺失条目的数据集,优于邻域插补算法和逐位置插补算法。两遍DTWimpute算法的性能还与生物界广泛使用的加权K近邻算法进行了基准测试;结果表明前一种算法优于后一种算法。鉴于这些发现清楚地表明了DTW技术在时间序列数据缺失值估计方面的附加价值,我们构建了两遍DTWimpute算法的优化C++实现。该软件还提供了三种不同的初始粗略插补方法供选择。