Yasui Yutaka, Pepe Margaret, Thompson Mary Lou, Adam Bao-Ling, Wright George L, Qu Yinsheng, Potter John D, Winget Marcy, Thornquist Mark, Feng Ziding
Cancer Prevention Research Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109-1024, USA.
Biostatistics. 2003 Jul;4(3):449-63. doi: 10.1093/biostatistics/4.3.449.
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data-analytic strategy takes properties of the SELDI mass spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After this pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
随着质谱技术的最新进展,现在有可能在小生物样本中研究分子量范围广泛的蛋白质。这一进展在蛋白质组学中带来了数据分析方面的挑战,类似于基因学中微阵列技术所产生的挑战,即从高维数据中发现特定于每种病理状态(如正常与癌症)的“特征”蛋白质谱,或实验条件之间的差异谱(如用感兴趣的药物处理与未处理)。我们提出了一种基于此类高维质谱数据发现蛋白质生物标志物的数据分析策略。在整篇论文中,以一个关于前列腺癌的实际生物标志物发现项目作为具体示例:该项目旨在使用表面增强激光解吸/电离(SELDI)技术(一种最近开发的质谱技术)识别血清中区分前列腺癌、良性增生和正常状态的蛋白质。我们的数据分析策略考虑了SELDI质谱仪的特性:一个样本的SELDI输出包含约48,000个(x, y)点,其中x是蛋白质质量除以电离引入的电荷数,y是该样本中对应每个质荷比(x值)的蛋白质强度。鉴于蛋白质强度测量值(y值)的高变异系数和其他特征,我们将蛋白质强度测量值简化为一组二元变量,这些变量表示在x轴方向上每个质荷点最近邻域中y轴方向上的峰值。然后我们考虑SELDI输出中x轴的偏移(测量误差)问题。在对数据进行这种预分析处理之后,我们将二元预测变量组合起来以生成前列腺癌、良性增生和正常状态的分类规则。我们的方法是应用提升算法来选择二元预测变量并构建一个汇总分类器。我们使用一个独立于用于构建汇总分类器的训练数据集的测试数据集,通过实证评估所得汇总分类器的敏感性和特异性。所提出的方法在区分癌症和良性增生与正常状态方面表现近乎完美。然而,在癌症与良性增生的分类中,相当一部分良性样本被错误地分类为癌症。我们讨论了与我们提出的SELDI输出分析方法及其在癌症生物标志物发现中的应用相关的实际问题。