缺失值插补对基因表达谱下游分析的生物学影响。

Biological impact of missing-value imputation on downstream analyses of gene expression profiles.

机构信息

Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA.

出版信息

Bioinformatics. 2011 Jan 1;27(1):78-86. doi: 10.1093/bioinformatics/btq613. Epub 2010 Nov 2.

DOI:10.1093/bioinformatics/btq613

PMID:21045072

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3008641/

Abstract

MOTIVATION

Microarray experiments frequently produce multiple missing values (MVs) due to flaws such as dust, scratches, insufficient resolution or hybridization errors on the chips. Unfortunately, many downstream algorithms require a complete data matrix. The motivation of this work is to determine the impact of MV imputation on downstream analysis, and whether ranking of imputation methods by imputation accuracy correlates well with the biological impact of the imputation.

METHODS

Using eight datasets for differential expression (DE) and classification analysis and eight datasets for gene clustering, we demonstrate the biological impact of missing-value imputation on statistical downstream analyses, including three commonly employed DE methods, four classifiers and three gene-clustering methods. Correlation between the rankings of imputation methods based on three root-mean squared error (RMSE) measures and the rankings based on the downstream analysis methods was used to investigate which RMSE measure was most consistent with the biological impact measures, and which downstream analysis methods were the most sensitive to the choice of imputation procedure.

RESULTS

DE was the most sensitive to the choice of imputation procedure, while classification was the least sensitive and clustering was intermediate between the two. The logged RMSE (LRMSE) measure had the highest correlation with the imputation rankings based on the DE results, indicating that the LRMSE is the best representative surrogate among the three RMSE-based measures. Bayesian principal component analysis and least squares adaptive appeared to be the best performing methods in the empirical downstream evaluation.

摘要

动机

微阵列实验由于芯片上的灰尘、划痕、分辨率不足或杂交错误等缺陷，经常会产生多个缺失值（MVs）。不幸的是，许多下游算法都需要一个完整的数据矩阵。这项工作的动机是确定缺失值插补对下游分析的影响，以及插补准确性对插补方法的排名是否与插补的生物学影响很好地相关。

方法

使用八个用于差异表达（DE）和分类分析的数据集和八个用于基因聚类的数据集，我们展示了缺失值插补对统计下游分析的生物学影响，包括三种常用的 DE 方法、四种分类器和三种基因聚类方法。基于三个均方根误差（RMSE）度量的插补方法的排名与基于下游分析方法的排名之间的相关性用于研究哪种 RMSE 度量与生物学影响度量最一致，以及哪种下游分析方法对插补程序的选择最敏感。

结果

DE 对插补程序的选择最敏感，而分类最不敏感，聚类介于两者之间。对数 RMSE（LRMSE）度量与基于 DE 结果的插补排名相关性最高，表明 LRMSE 是三个 RMSE 度量中最好的代表替代物。贝叶斯主成分分析和最小二乘自适应在经验性下游评估中似乎表现最好。

相似文献

Biological impact of missing-value imputation on downstream analyses of gene expression profiles.缺失值插补对基因表达谱下游分析的生物学影响。

Bioinformatics. 2011 Jan 1;27(1):78-86. doi: 10.1093/bioinformatics/btq613. Epub 2010 Nov 2.

Impact of missing data imputation methods on gene expression clustering and classification.缺失数据插补方法对基因表达聚类和分类的影响。

BMC Bioinformatics. 2015 Feb 26;16:64. doi: 10.1186/s12859-015-0494-3.

Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules.基于基因表达谱和功能模块，替换不可靠的cDNA微阵列测量值对疾病分类的影响。

Bioinformatics. 2006 Dec 1;22(23):2883-9. doi: 10.1093/bioinformatics/btl339. Epub 2006 Jun 29.

Gaussian mixture clustering and imputation of microarray data.微阵列数据的高斯混合聚类与插补

Bioinformatics. 2004 Apr 12;20(6):917-23. doi: 10.1093/bioinformatics/bth007. Epub 2004 Jan 29.

A multi-stage approach to clustering and imputation of gene expression profiles.一种用于基因表达谱聚类和插补的多阶段方法。

Bioinformatics. 2007 Apr 15;23(8):998-1005. doi: 10.1093/bioinformatics/btm053. Epub 2007 Feb 18.

A hybrid imputation approach for microarray missing value estimation.一种用于微阵列缺失值估计的混合插补方法。

BMC Genomics. 2015;16 Suppl 9(Suppl 9):S1. doi: 10.1186/1471-2164-16-S9-S1. Epub 2015 Aug 17.

DNA microarray data imputation and significance analysis of differential expression.DNA微阵列数据插补与差异表达的显著性分析

Bioinformatics. 2005 Nov 15;21(22):4155-61. doi: 10.1093/bioinformatics/bti638. Epub 2005 Aug 23.

Ameliorative missing value imputation for robust biological knowledge inference.用于稳健生物学知识推理的改进型缺失值插补

J Biomed Inform. 2008 Aug;41(4):499-514. doi: 10.1016/j.jbi.2007.10.005. Epub 2007 Dec 31.

A global learning with local preservation method for microarray data imputation.一种用于微阵列数据插补的全局学习与局部保留方法。

Comput Biol Med. 2016 Oct 1;77:76-89. doi: 10.1016/j.compbiomed.2016.08.005. Epub 2016 Aug 5.

Missing value imputation improves clustering and interpretation of gene expression microarray data.缺失值插补可改善基因表达微阵列数据的聚类和解读。

BMC Bioinformatics. 2008 Apr 18;9:202. doi: 10.1186/1471-2105-9-202.

引用本文的文献

A Hands-On Introduction to Data Analytics for Biomedical Research.生物医学研究数据分析实践入门

Function (Oxf). 2025 Mar 24;6(2). doi: 10.1093/function/zqaf015.

Comprehensive Evaluation of Advanced Imputation Methods for Proteomic Data Acquired via the Label-Free Approach.通过无标记方法获取的蛋白质组学数据的先进插补方法综合评估

Int J Mol Sci. 2024 Dec 17;25(24):13491. doi: 10.3390/ijms252413491.

Wise Roles and Future Visionary Endeavors of Current Emperor: Advancing Dynamic Methods for Longitudinal Microbiome Meta-Omics Data in Personalized and Precision Medicine.当代帝王的明智角色与未来前瞻性努力：推进个性化与精准医学中纵向微生物组元组学数据的动态方法

Adv Sci (Weinh). 2024 Dec;11(47):e2400458. doi: 10.1002/advs.202400458. Epub 2024 Nov 13.

Censored Least Squares for Imputing Missing Values in PARAFAC Tensor Factorization.用于在PARAFAC张量分解中插补缺失值的截尾最小二乘法

bioRxiv. 2024 Jul 10:2024.07.05.602272. doi: 10.1101/2024.07.05.602272.

Machine learning integrative approaches to advance computational immunology.机器学习综合方法推进计算免疫学。

Genome Med. 2024 Jun 11;16(1):80. doi: 10.1186/s13073-024-01350-3.

The Performance Evaluation of The Random Forest Algorithm for A Gene Selection in Identifying Genes Associated with Resectable Pancreatic Cancer in Microarray Dataset: A Retrospective Study.用于在微阵列数据集中识别可切除胰腺癌相关基因的基因选择的随机森林算法性能评估：一项回顾性研究

Cell J. 2023 May 28;25(5):347-353. doi: 10.22074/cellj.2023.1971852.1156.

Evaluation of different approaches for missing data imputation on features associated to genomic data.评估基因组数据相关特征中缺失数据插补的不同方法。

BioData Min. 2021 Sep 3;14(1):44. doi: 10.1186/s13040-021-00274-7.

Bioinformatic Analysis of Temporal and Spatial Proteome Alternations During Infections.感染期间时空蛋白质组变化的生物信息学分析

Front Genet. 2021 Jul 2;12:667936. doi: 10.3389/fgene.2021.667936. eCollection 2021.

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes.一种灵活、可解释且准确的方法，用于推断未测量基因的表达。

Nucleic Acids Res. 2020 Dec 2;48(21):e125. doi: 10.1093/nar/gkaa881.

Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network.基于迁移学习的神经网络对 RNA 测序缺失数据进行推断。

Gigascience. 2020 Jul 1;9(7). doi: 10.1093/gigascience/giaa076.

本文引用的文献

Over-optimism in bioinformatics: an illustration.生物信息学中的过度乐观：一个例证。

Bioinformatics. 2010 Aug 15;26(16):1990-8. doi: 10.1093/bioinformatics/btq323. Epub 2010 Jun 26.

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments.比较缺失值插补方法以提高微阵列实验的聚类和解释。

BMC Genomics. 2010 Jan 7;11:15. doi: 10.1186/1471-2164-11-15.

Dealing with missing values in large-scale studies: microarray data imputation and beyond.处理大规模研究中的缺失值：微阵列数据插补及其他方法。

Brief Bioinform. 2010 Mar;11(2):253-64. doi: 10.1093/bib/bbp059. Epub 2009 Dec 4.

Apparently low reproducibility of true differential expression discoveries in microarray studies.微阵列研究中真正差异表达发现的明显低可重复性。

Bioinformatics. 2008 Sep 15;24(18):2057-63. doi: 10.1093/bioinformatics/btn365. Epub 2008 Jul 16.

Missing value imputation improves clustering and interpretation of gene expression microarray data.缺失值插补可改善基因表达微阵列数据的聚类和解读。

BMC Bioinformatics. 2008 Apr 18;9:202. doi: 10.1186/1471-2105-9-202.

Ameliorative missing value imputation for robust biological knowledge inference.用于稳健生物学知识推理的改进型缺失值插补

J Biomed Inform. 2008 Aug;41(4):499-514. doi: 10.1016/j.jbi.2007.10.005. Epub 2007 Dec 31.

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes.在表达谱中应使用哪种缺失值插补方法：一项比较研究及两种选择方案

BMC Bioinformatics. 2008 Jan 10;9:12. doi: 10.1186/1471-2105-9-12.

pcaMethods--a bioconductor package providing PCA methods for incomplete data.pcaMethods——一个生物导体软件包，为不完整数据提供主成分分析方法。

Bioinformatics. 2007 May 1;23(9):1164-7. doi: 10.1093/bioinformatics/btm069. Epub 2007 Mar 7.

Bioinformatics. 2006 Dec 1;22(23):2883-9. doi: 10.1093/bioinformatics/btl339. Epub 2006 Jun 29.

Linear models and empirical bayes methods for assessing differential expression in microarray experiments.用于评估微阵列实验中差异表达的线性模型和经验贝叶斯方法。

Stat Appl Genet Mol Biol. 2004;3:Article3. doi: 10.2202/1544-6115.1027. Epub 2004 Feb 12.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验