Hu Jianjun, Li Haifeng, Waterman Michael S, Zhou Xianghong Jasmine
Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA 900089, USA.
BMC Bioinformatics. 2006 Oct 12;7:449. doi: 10.1186/1471-2105-7-449.
Missing value estimation is an important preprocessing step in microarray analysis. Although several methods have been developed to solve this problem, their performance is unsatisfactory for datasets with high rates of missing data, high measurement noise, or limited numbers of samples. In fact, more than 80% of the time-series datasets in Stanford Microarray Database contain less than eight samples.
We present the integrative Missing Value Estimation method (iMISS) by incorporating information from multiple reference microarray datasets to improve missing value estimation. For each gene with missing data, we derive a consistent neighbor-gene list by taking reference data sets into consideration. To determine whether the given reference data sets are sufficiently informative for integration, we use a submatrix imputation approach. Our experiments showed that iMISS can significantly and consistently improve the accuracy of the state-of-the-art Local Least Square (LLS) imputation algorithm by up to 15% improvement in our benchmark tests.
We demonstrated that the order-statistics-based integrative imputation algorithms can achieve significant improvements over the state-of-the-art missing value estimation approaches such as LLS and is especially good for imputing microarray datasets with a limited number of samples, high rates of missing data, or very noisy measurements. With the rapid accumulation of microarray datasets, the performance of our approach can be further improved by incorporating larger and more appropriate reference datasets.
缺失值估计是微阵列分析中的一个重要预处理步骤。尽管已经开发了几种方法来解决这个问题,但对于具有高缺失数据率、高测量噪声或有限样本数量的数据集,它们的性能并不理想。事实上,斯坦福微阵列数据库中超过80%的时间序列数据集包含少于8个样本。
我们通过整合来自多个参考微阵列数据集的信息来提出综合缺失值估计算法(iMISS),以改进缺失值估计。对于每个有缺失数据的基因,我们通过考虑参考数据集来推导一致的相邻基因列表。为了确定给定的参考数据集是否具有足够的信息用于整合,我们使用子矩阵插补方法。我们的实验表明,在我们的基准测试中,iMISS可以显著且持续地将最先进的局部最小二乘(LLS)插补算法的准确性提高多达15%。
我们证明了基于顺序统计的综合插补算法相对于诸如LLS等最先进的缺失值估计方法可以实现显著改进,并且对于插补具有有限样本数量、高缺失数据率或非常嘈杂测量的微阵列数据集特别有效。随着微阵列数据集的快速积累,通过纳入更大且更合适的参考数据集,我们方法的性能可以进一步提高。