在表达谱中应使用哪种缺失值插补方法：一项比较研究及两种选择方案

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

作者信息

Brock Guy N, Shaffer John R, Blakesley Richard E, Lotz Meredith J, Tseng George C

机构信息

Department of Bioinformatics and Biostatistics, School of Public Health and Information Sciences, Universtiy of Louisville, Louisville, KY 40292, USA.

出版信息

BMC Bioinformatics. 2008 Jan 10;9:12. doi: 10.1186/1471-2105-9-12.

DOI:10.1186/1471-2105-9-12

PMID:18186917

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2253514/

Abstract

BACKGROUND

Gene expression data frequently contain missing values, however, most down-stream analyses for microarray experiments require complete data. In the literature many methods have been proposed to estimate missing values via information of the correlation patterns within the gene expression matrix. Each method has its own advantages, but the specific conditions for which each method is preferred remains largely unclear. In this report we describe an extensive evaluation of eight current imputation methods on multiple types of microarray experiments, including time series, multiple exposures, and multiple exposures x time series data. We then introduce two complementary selection schemes for determining the most appropriate imputation method for any given data set.

RESULTS

We found that the optimal imputation algorithms (LSA, LLS, and BPCA) are all highly competitive with each other, and that no method is uniformly superior in all the data sets we examined. The success of each method can also depend on the underlying "complexity" of the expression data, where we take complexity to indicate the difficulty in mapping the gene expression matrix to a lower-dimensional subspace. We developed an entropy measure to quantify the complexity of expression matrixes and found that, by incorporating this information, the entropy-based selection (EBS) scheme is useful for selecting an appropriate imputation algorithm. We further propose a simulation-based self-training selection (STS) scheme. This technique has been used previously for microarray data imputation, but for different purposes. The scheme selects the optimal or near-optimal method with high accuracy but at an increased computational cost.

CONCLUSION

Our findings provide insight into the problem of which imputation method is optimal for a given data set. Three top-performing methods (LSA, LLS and BPCA) are competitive with each other. Global-based imputation methods (PLS, SVD, BPCA) performed better on mcroarray data with lower complexity, while neighbour-based methods (KNN, OLS, LSA, LLS) performed better in data with higher complexity. We also found that the EBS and STS schemes serve as complementary and effective tools for selecting the optimal imputation algorithm.

摘要

背景

基因表达数据常常包含缺失值，然而，大多数针对微阵列实验的下游分析都需要完整的数据。在文献中，已经提出了许多方法来通过基因表达矩阵内的相关模式信息估计缺失值。每种方法都有其自身的优点，但每种方法更适用的具体条件在很大程度上仍不明确。在本报告中，我们描述了对当前八种插补方法在多种类型微阵列实验上的广泛评估，这些实验包括时间序列、多次暴露以及多次暴露×时间序列数据。然后我们引入了两种互补的选择方案，用于为任何给定数据集确定最合适的插补方法。

结果

我们发现最优插补算法（LSA、LLS和BPCA）彼此之间都具有很强的竞争力，并且在我们研究的所有数据集中没有一种方法在各方面都表现最优。每种方法的成功还可能取决于表达数据潜在的“复杂性”，我们将复杂性定义为将基因表达矩阵映射到低维子空间的难度。我们开发了一种熵度量来量化表达矩阵的复杂性，并发现通过纳入此信息，基于熵的选择（EBS）方案有助于选择合适的插补算法。我们进一步提出了一种基于模拟的自训练选择（STS）方案。该技术先前已用于微阵列数据插补，但目的不同。该方案能高精度地选择最优或接近最优的方法，但计算成本会增加。

结论

我们的研究结果为针对给定数据集哪种插补方法最优这一问题提供了见解。三种表现最佳的方法（LSA、LLS和BPCA）相互竞争。基于全局的插补方法（PLS、SVD、BPCA）在复杂性较低的微阵列数据上表现更好，而基于邻域的方法（KNN、OLS、LSA、LLS）在复杂性较高的数据上表现更好。我们还发现EBS和STS方案是选择最优插补算法的互补且有效的工具。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/af27/2253514/6842f73e6fac/1471-2105-9-12-1.jpg

相似文献

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

BMC Bioinformatics. 2008 Jan 10;9:12. doi: 10.1186/1471-2105-9-12.

Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.

Bioinformatics. 2005 May 15;21(10):2417-23. doi: 10.1093/bioinformatics/bti345. Epub 2005 Feb 24.

DNA microarray data imputation and significance analysis of differential expression.

Bioinformatics. 2005 Nov 15;21(22):4155-61. doi: 10.1093/bioinformatics/bti638. Epub 2005 Aug 23.

Ameliorative missing value imputation for robust biological knowledge inference.

J Biomed Inform. 2008 Aug;41(4):499-514. doi: 10.1016/j.jbi.2007.10.005. Epub 2007 Dec 31.

Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme.

BMC Bioinformatics. 2006 Jan 22;7:32. doi: 10.1186/1471-2105-7-32.

A hybrid imputation approach for microarray missing value estimation.

BMC Genomics. 2015;16 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2164-16-S9-S1. Epub 2015 Aug 17.

Robust imputation method for missing values in microarray data.

BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S6. doi: 10.1186/1471-2105-8-S2-S6.

Missing value imputation in high-dimensional phenomic data: imputable or not, and how?

BMC Bioinformatics. 2014 Nov 5;15(1):346. doi: 10.1186/s12859-014-0346-6.

Two-pass imputation algorithm for missing value estimation in gene expression time series.

J Bioinform Comput Biol. 2007 Oct;5(5):1005-22. doi: 10.1142/s0219720007003053.

Dealing with gene expression missing data.

Syst Biol (Stevenage). 2006 May;153(3):105-19. doi: 10.1049/ip-syb:20050056.

引用本文的文献

A multi-classification deep neural network for cancer type identification from high-dimension, small-sample and imbalanced gene microarray data.

Sci Rep. 2025 Feb 12;15(1):5239. doi: 10.1038/s41598-025-89475-2.

Big Data in Gastroenterology Research.

Int J Mol Sci. 2023 Jan 27;24(3):2458. doi: 10.3390/ijms24032458.

Computational approaches for predicting variant impact: An overview from resources, principles to applications.

Front Genet. 2022 Sep 29;13:981005. doi: 10.3389/fgene.2022.981005. eCollection 2022.

Development and Validation of a New Multiparametric Random Survival Forest Predictive Model for Breast Cancer Recurrence with a Potential Benefit to Individual Outcomes.

Cancer Manag Res. 2022 Mar 1;14:909-923. doi: 10.2147/CMAR.S346871. eCollection 2022.

Latent triple trajectories of substance use as predictors for the onset of antisocial personality disorder among urban African American and Puerto Rican adults: A 22-year longitudinal study.

Subst Abus. 2022;43(1):442-450. doi: 10.1080/08897077.2021.1946890.

An Ensemble Method for Missing Data of Environmental Sensor Considering Univariate and Multivariate Characteristics.

Sensors (Basel). 2021 Nov 16;21(22):7595. doi: 10.3390/s21227595.

Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic.

Popul Health Metr. 2021 Nov 4;19(1):44. doi: 10.1186/s12963-021-00274-z.

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.

Nucleic Acids Res. 2020 Dec 2;48(21):e125. doi: 10.1093/nar/gkaa881.

A Review of Imputation Strategies for Isobaric Labeling-Based Shotgun Proteomics.

J Proteome Res. 2021 Jan 1;20(1):1-13. doi: 10.1021/acs.jproteome.0c00123. Epub 2020 Sep 25.

Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique.

Int J Mol Sci. 2018 Oct 30;19(11):3398. doi: 10.3390/ijms19113398.

本文引用的文献

Evaluation and comparison of gene clustering methods in microarray analysis.

Bioinformatics. 2006 Oct 1;22(19):2405-12. doi: 10.1093/bioinformatics/btl406. Epub 2006 Jul 31.

Improving missing value imputation of microarray data by using spot quality weights.

BMC Bioinformatics. 2006 Jun 16;7:306. doi: 10.1186/1471-2105-7-306.

Prediction of missing values in microarray and use of mixed models to evaluate the predictors.

Stat Appl Genet Mol Biol. 2005;4:Article10. doi: 10.2202/1544-6115.1120. Epub 2005 May 5.

Microarray missing data imputation based on a set theoretic framework and biological knowledge.

Nucleic Acids Res. 2006 Mar 20;34(5):1608-19. doi: 10.1093/nar/gkl047. Print 2006.

Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme.

BMC Bioinformatics. 2006 Jan 22;7:32. doi: 10.1186/1471-2105-7-32.

Improving missing value estimation in microarray data with gene ontology.

Bioinformatics. 2006 Mar 1;22(5):566-72. doi: 10.1093/bioinformatics/btk019. Epub 2005 Dec 23.

The influence of missing value imputation on detection of differentially expressed genes from microarray data.

Bioinformatics. 2005 Dec 1;21(23):4272-9. doi: 10.1093/bioinformatics/bti708. Epub 2005 Oct 10.

DNA microarray data imputation and significance analysis of differential expression.

Bioinformatics. 2005 Nov 15;21(22):4155-61. doi: 10.1093/bioinformatics/bti638. Epub 2005 Aug 23.

Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data.

Bioinformatics. 2005 May 15;21(10):2417-23. doi: 10.1093/bioinformatics/bti345. Epub 2005 Feb 24.

Missing value estimation for DNA microarray gene expression data: local least squares imputation.

Bioinformatics. 2005 Jan 15;21(2):187-98. doi: 10.1093/bioinformatics/bth499. Epub 2004 Aug 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在表达谱中应使用哪种缺失值插补方法：一项比较研究及两种选择方案

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSION

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献