基于凝胶的蛋白质组学数据多变量统计分析中缺失值的处理

Treatment of missing values for multivariate statistical analysis of gel-based proteomics data.

作者信息

Pedreschi Romina, Hertog Maarten L A T M, Carpentier Sebastien C, Lammertyn Jeroen, Robben Johan, Noben Jean-Paul, Panis Bart, Swennen Rony, Nicolaï Bart M

机构信息

BIOSYST-MeBioS Division, Katholieke Universiteit Leuven, Leuven, Belgium.

出版信息

Proteomics. 2008 Apr;8(7):1371-83. doi: 10.1002/pmic.200700975.

DOI:10.1002/pmic.200700975

PMID:18383008

Abstract

The presence of missing values in gel-based proteomics data represents a real challenge if an objective statistical analysis is pursued. Different methods to handle missing values were evaluated and their influence is discussed on the selection of important proteins through multivariate techniques. The evaluated methods consisted of directly dealing with them during the multivariate analysis with the nonlinear estimation by iterative partial least squares (NIPALS) algorithm or imputing them by using either k-nearest neighbor or Bayesian principal component analysis (BPCA) before carrying out the multivariate analysis. These techniques were applied to data obtained from gels stained with classical postrunning dyes and from DIGE gels. Before applying the multivariate techniques, the normality and homoscedasticity assumptions on which parametric tests are based on were tested in order to perform a sound statistical analysis. From the three tested methods to handle missing values in our datasets, BPCA imputation of missing values showed to be the most consistent method.

摘要

如果要进行客观的统计分析，基于凝胶的蛋白质组学数据中缺失值的存在是一个真正的挑战。评估了处理缺失值的不同方法，并讨论了它们在通过多变量技术选择重要蛋白质方面的影响。评估的方法包括在多变量分析期间通过迭代偏最小二乘法（NIPALS）算法进行非线性估计直接处理缺失值，或者在进行多变量分析之前使用k近邻法或贝叶斯主成分分析（BPCA）对缺失值进行插补。这些技术应用于从用经典的运行后染料染色的凝胶以及差异凝胶电泳（DIGE）凝胶获得的数据。在应用多变量技术之前，测试了参数检验所基于的正态性和同方差性假设，以便进行合理的统计分析。在我们的数据集中用于处理缺失值的三种测试方法中，BPCA对缺失值的插补显示是最一致的方法。