Suppr超能文献

用于稳健生物学知识推理的改进型缺失值插补

Ameliorative missing value imputation for robust biological knowledge inference.

作者信息

Sehgal Muhammad Shoaib B, Gondal Iqbal, Dooley Laurence S, Coppel Ross

机构信息

Faculty of Information Technology, Monash University, Northways Road, Churchill, Vic. 3842, Australia.

出版信息

J Biomed Inform. 2008 Aug;41(4):499-514. doi: 10.1016/j.jbi.2007.10.005. Epub 2007 Dec 31.

Abstract

Gene expression data is widely used in various post genomic analyses. The data is often probed using microarrays due to their ability to simultaneously measure the expressions of thousands of genes. The expression data, however, contains significant numbers of missing values, which can impact on subsequent biological analysis. To minimize the impact of these missing values, several imputation algorithms including Collateral Missing Value Estimation (CMVE), Bayesian Principal Component Analysis (BPCA), Least Square Impute (LSImpute), Local Least Square Impute (LLSImpute), and K-Nearest Neighbour (KNN) have been proposed. These algorithms, however, exploit either only the global or local correlation structure of the data, which normally can lead to higher estimation errors. This paper presents an Ameliorative Missing Value Imputation (AMVI) technique which has ability to exploit global/local and positive/negative correlations in a given dataset by automatic selection of the optimal number of predictor genes k using a wrapper non-parametric method based on Monte Carlo simulations. The AMVI technique has CMVE strategy at its core because CMVE has demonstrated improved performance compared to both low variance methods like BPCA, LLSImpute, and high variance methods such as KNN and ZeroImpute, as CMVE exploits positive/negative correlations. The performance of AMVI is compared with CMVE, BPCA, LLSImpute, and KNN by randomly removing between 1% and 15% missing values in eight different ovarian, breast cancer and yeast datasets. Together with the standard NRMS error metric, the True Positive (TP) rate of the significant genes selection, biological significance of the selected genes and the statistical significance test results are presented to investigate the impact of missing values on subsequent biological analysis. The enhanced performance of AMVI was demonstrated by its lower NRMS error, improved TP rate, bio significance of the selected genes and statistical significance test results, when compared with the aforementioned imputation methods across all the datasets. The results show that AMVI adapted to the latent correlation structure of the data and proved to be an effective and robust approach compared with the trial and error methodology for selecting k. The results confirmed that AMVI can be successfully applied to accurately impute missing values prior to any microarray data analysis.

摘要

基因表达数据在各种后基因组分析中被广泛使用。由于微阵列能够同时测量数千个基因的表达,因此该数据通常使用微阵列进行探测。然而,表达数据包含大量缺失值,这可能会影响后续的生物学分析。为了最小化这些缺失值的影响,已经提出了几种插补算法,包括并行缺失值估计(CMVE)、贝叶斯主成分分析(BPCA)、最小二乘插补(LSImpute)、局部最小二乘插补(LLSImpute)和K近邻(KNN)。然而,这些算法要么只利用数据的全局相关结构,要么只利用局部相关结构,这通常会导致更高的估计误差。本文提出了一种改进的缺失值插补(AMVI)技术,该技术能够通过基于蒙特卡罗模拟的包装非参数方法自动选择最佳预测基因数量k,来利用给定数据集中的全局/局部以及正/负相关性。AMVI技术以CMVE策略为核心,因为与BPCA、LSImpute等低方差方法以及KNN和零插补等高方差方法相比,CMVE已证明具有更好的性能,因为CMVE利用了正/负相关性。通过在八个不同的卵巢癌、乳腺癌和酵母数据集中随机去除1%至15%的缺失值,将AMVI的性能与CMVE、BPCA、LSImpute和KNN进行了比较。除了标准的NRMS误差度量外,还给出了显著基因选择的真阳性(TP)率、所选基因的生物学意义以及统计显著性检验结果,以研究缺失值对后续生物学分析的影响。与所有数据集中的上述插补方法相比,AMVI具有更低的NRMS误差、更高的TP率、所选基因的生物学意义和统计显著性检验结果,证明了其性能的提升。结果表明,AMVI适应了数据的潜在相关结构,与用于选择k的试错方法相比,是一种有效且稳健的方法。结果证实,AMVI可以成功应用于在任何微阵列数据分析之前准确插补缺失值。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验