Tuikkala Johannes, Elo Laura, Nevalainen Olli S, Aittokallio Tero
Department of Information Technology, University of Turku, Lemminkäisenkatu 14A, FIN-20520, Finland.
Bioinformatics. 2006 Mar 1;22(5):566-72. doi: 10.1093/bioinformatics/btk019. Epub 2005 Dec 23.
Gene expression microarray experiments produce datasets with frequent missing expression values. Accurate estimation of missing values is an important prerequisite for efficient data analysis as many statistical and machine learning techniques either require a complete dataset or their results are significantly dependent on the quality of such estimates. A limitation of the existing estimation methods for microarray data is that they use no external information but the estimation is based solely on the expression data. We hypothesized that utilizing a priori information on functional similarities available from public databases facilitates the missing value estimation.
We investigated whether semantic similarity originating from gene ontology (GO) annotations could improve the selection of relevant genes for missing value estimation. The relative contribution of each information source was automatically estimated from the data using an adaptive weight selection procedure. Our experimental results in yeast cDNA microarray datasets indicated that by considering GO information in the k-nearest neighbor algorithm we can enhance its performance considerably, especially when the number of experimental conditions is small and the percentage of missing values is high. The increase of performance was less evident with a more sophisticated estimation method. We conclude that even a small proportion of annotated genes can provide improvements in data quality significant for the eventual interpretation of the microarray experiments.
Java and Matlab codes are available on request from the authors.
Available online at http://users.utu.fi/jotatu/GOImpute.html.
基因表达微阵列实验产生的数据集经常存在缺失的表达值。准确估计缺失值是高效数据分析的重要前提,因为许多统计和机器学习技术要么需要完整的数据集,要么其结果在很大程度上依赖于此类估计的质量。现有微阵列数据估计方法的一个局限性在于,它们不使用外部信息,估计仅基于表达数据。我们假设利用公共数据库中可用的功能相似性先验信息有助于缺失值估计。
我们研究了源自基因本体(GO)注释的语义相似性是否能改进用于缺失值估计的相关基因选择。使用自适应权重选择程序从数据中自动估计每个信息源的相对贡献。我们在酵母cDNA微阵列数据集上的实验结果表明,在k近邻算法中考虑GO信息可以显著提高其性能,尤其是当实验条件数量较少且缺失值百分比很高时。对于更复杂的估计方法,性能提升不太明显。我们得出结论,即使是一小部分带注释的基因也能显著提高数据质量,这对于微阵列实验的最终解读很重要。
可根据作者要求获取Java和Matlab代码。