Meuleman Wouter, Engwegen Judith Ymn, Gast Marie-Christine W, Beijnen Jos H, Reinders Marcel Jt, Wessels Lodewyk Fa
Bioinformatics and Statistics, Department of Molecular Biology, The Netherlands Cancer Institute, Amsterdam, The Netherlands.
BMC Bioinformatics. 2008 Feb 7;9:88. doi: 10.1186/1471-2105-9-88.
Mass spectrometry for biological data analysis is an active field of research, providing an efficient way of high-throughput proteome screening. A popular variant of mass spectrometry is SELDI, which is often used to measure sample populations with the goal of developing (clinical) classifiers. Unfortunately, not only is the data resulting from such measurements quite noisy, variance between replicate measurements of the same sample can be high as well. Normalisation of spectra can greatly reduce the effect of this technical variance and further improve the quality and interpretability of the data. However, it is unclear which normalisation method yields the most informative result.
In this paper, we describe the first systematic comparison of a wide range of normalisation methods, using two objectives that should be met by a good method. These objectives are minimisation of inter-spectra variance and maximisation of signal with respect to class separation. The former is assessed using an estimation of the coefficient of variation, the latter using the classification performance of three types of classifiers on real-world datasets representing two-class diagnostic problems. To obtain a maximally robust evaluation of a normalisation method, both objectives are evaluated over multiple datasets and multiple configurations of baseline correction and peak detection methods. Results are assessed for statistical significance and visualised to reveal the performance of each normalisation method, in particular with respect to using no normalisation. The normalisation methods described have been implemented in the freely available MASDA R-package.
In the general case, normalisation of mass spectra is beneficial to the quality of data. The majority of methods we compared performed significantly better than the case in which no normalisation was used. We have shown that normalisation methods that scale spectra by a factor based on the dispersion (e.g., standard deviation) of the data clearly outperform those where a factor based on the central location (e.g., mean) is used. Additional improvements in performance are obtained when these factors are estimated locally, using a sliding window within spectra, instead of globally, over full spectra. The underperforming category of methods using a globally estimated factor based on the central location of the data includes the method used by the majority of SELDI users.
用于生物数据分析的质谱技术是一个活跃的研究领域,为高通量蛋白质组筛选提供了一种有效的方法。质谱技术的一种流行变体是表面增强激光解吸电离飞行时间质谱(SELDI),它经常用于测量样本群体,目的是开发(临床)分类器。不幸的是,这种测量产生的数据不仅噪声很大,而且同一样本重复测量之间的差异也可能很大。光谱归一化可以大大降低这种技术差异的影响,并进一步提高数据的质量和可解释性。然而,尚不清楚哪种归一化方法能产生最具信息性的结果。
在本文中,我们描述了对多种归一化方法的首次系统比较,使用了一个好的方法应满足的两个目标。这些目标是最小化光谱间差异以及最大化相对于类别分离的信号。前者使用变异系数估计进行评估,后者使用三种类型分类器在代表两类诊断问题的真实数据集上的分类性能进行评估。为了获得对归一化方法的最大稳健评估,这两个目标在多个数据集以及基线校正和峰检测方法的多种配置上进行评估。对结果进行统计显著性评估并可视化,以揭示每种归一化方法的性能,特别是相对于不进行归一化的情况。所描述的归一化方法已在免费提供的MASDA R包中实现。
一般情况下,质谱归一化对数据质量有益。我们比较的大多数方法的性能明显优于不进行归一化的情况。我们已经表明,通过基于数据离散度(例如标准差)的因子对光谱进行缩放的归一化方法明显优于使用基于中心位置(例如均值)的因子的方法。当使用光谱内的滑动窗口局部估计这些因子而不是在整个光谱上全局估计时,性能会有进一步提升。使用基于数据中心位置全局估计因子的表现不佳的方法类别包括大多数SELDI用户使用的方法。