一种用于蛋白质生物标志物发现的数据分析策略：用于癌症检测的高维蛋白质组学数据剖析

A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.

作者信息

Yasui Yutaka, Pepe Margaret, Thompson Mary Lou, Adam Bao-Ling, Wright George L, Qu Yinsheng, Potter John D, Winget Marcy, Thornquist Mark, Feng Ziding

机构信息

Cancer Prevention Research Program, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N., Seattle, WA 98109-1024, USA.

出版信息

Biostatistics. 2003 Jul;4(3):449-63. doi: 10.1093/biostatistics/4.3.449.

DOI:10.1093/biostatistics/4.3.449

PMID:12925511

Abstract

With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data-analytic strategy takes properties of the SELDI mass spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After this pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.

摘要

随着质谱技术的最新进展，现在有可能在小生物样本中研究分子量范围广泛的蛋白质。这一进展在蛋白质组学中带来了数据分析方面的挑战，类似于基因学中微阵列技术所产生的挑战，即从高维数据中发现特定于每种病理状态（如正常与癌症）的“特征”蛋白质谱，或实验条件之间的差异谱（如用感兴趣的药物处理与未处理）。我们提出了一种基于此类高维质谱数据发现蛋白质生物标志物的数据分析策略。在整篇论文中，以一个关于前列腺癌的实际生物标志物发现项目作为具体示例：该项目旨在使用表面增强激光解吸/电离（SELDI）技术（一种最近开发的质谱技术）识别血清中区分前列腺癌、良性增生和正常状态的蛋白质。我们的数据分析策略考虑了SELDI质谱仪的特性：一个样本的SELDI输出包含约48,000个（x, y）点，其中x是蛋白质质量除以电离引入的电荷数，y是该样本中对应每个质荷比（x值）的蛋白质强度。鉴于蛋白质强度测量值（y值）的高变异系数和其他特征，我们将蛋白质强度测量值简化为一组二元变量，这些变量表示在x轴方向上每个质荷点最近邻域中y轴方向上的峰值。然后我们考虑SELDI输出中x轴的偏移（测量误差）问题。在对数据进行这种预分析处理之后，我们将二元预测变量组合起来以生成前列腺癌、良性增生和正常状态的分类规则。我们的方法是应用提升算法来选择二元预测变量并构建一个汇总分类器。我们使用一个独立于用于构建汇总分类器的训练数据集的测试数据集，通过实证评估所得汇总分类器的敏感性和特异性。所提出的方法在区分癌症和良性增生与正常状态方面表现近乎完美。然而，在癌症与良性增生的分类中，相当一部分良性样本被错误地分类为癌症。我们讨论了与我们提出的SELDI输出分析方法及其在癌症生物标志物发现中的应用相关的实际问题。

相似文献

A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.一种用于蛋白质生物标志物发现的数据分析策略：用于癌症检测的高维蛋白质组学数据剖析

Biostatistics. 2003 Jul;4(3):449-63. doi: 10.1093/biostatistics/4.3.449.

Application of serum SELDI proteomic patterns in diagnosis of lung cancer.血清表面增强激光解吸电离飞行时间质谱蛋白质组学图谱在肺癌诊断中的应用。

BMC Cancer. 2005 Jul 20;5:83. doi: 10.1186/1471-2407-5-83.

Identification of lung cancer patients by serum protein profiling using surface-enhanced laser desorption/ionization time-of-flight mass spectrometry.使用表面增强激光解吸/电离飞行时间质谱法通过血清蛋白质谱分析鉴定肺癌患者

Am J Clin Oncol. 2008 Apr;31(2):133-9. doi: 10.1097/COC.0b013e318145b98b.

[Proteomic analysis of prostate cancer using surface enhanced laser desorption/ionization mass spectrometry].[利用表面增强激光解吸/电离质谱法对前列腺癌进行蛋白质组学分析]

Zhonghua Yi Xue Za Zhi. 2005 Nov 30;85(45):3172-5.

Simultaneous and exact interval estimates for the contrast of two groups based on an extremely high dimensional variable: application to mass spec data.基于极高维变量的两组对比的同时精确区间估计：在质谱数据中的应用

Bioinformatics. 2007 Jun 15;23(12):1451-8. doi: 10.1093/bioinformatics/btm130. Epub 2007 Apr 25.

A serum proteomic pattern for the detection of colorectal adenocarcinoma using surface enhanced laser desorption and ionization mass spectrometry.一种使用表面增强激光解吸电离质谱法检测结肠直肠癌的血清蛋白质组模式。

Cancer Invest. 2006 Dec;24(8):747-53. doi: 10.1080/07357900601063873.

A robust meta-classification strategy for cancer detection from MS data.一种用于从质谱数据中进行癌症检测的强大元分类策略。

Proteomics. 2006 Jan;6(2):592-604. doi: 10.1002/pmic.200500192.

Prostate cancer biomarker discovery using high performance mass spectral serum profiling.利用高性能质谱血清分析技术发现前列腺癌生物标志物

Comput Methods Programs Biomed. 2009 Oct;96(1):33-41. doi: 10.1016/j.cmpb.2009.04.003. Epub 2009 May 6.

Proteomic data analysis workflow for discovery of candidate biomarker peaks predictive of clinical outcome for patients with acute myeloid leukemia.用于发现预测急性髓性白血病患者临床结局的候选生物标志物峰的蛋白质组学数据分析流程。

J Proteome Res. 2008 Jun;7(6):2332-41. doi: 10.1021/pr070482e. Epub 2008 May 2.

[Detection and clinical significance of serum proteomic patterns of breast cancers by surface enhanced laser desorption/ionization time of flight mass spectrometry].[表面增强激光解吸/电离飞行时间质谱法检测乳腺癌血清蛋白质组图谱及其临床意义]

Zhonghua Zhong Liu Za Zhi. 2006 Mar;28(3):204-7.

引用本文的文献

An Inflection Point in Cancer Protein Biomarkers: What was and What's Next.癌症蛋白生物标志物的转折点：过去的成就与未来的展望。

Mol Cell Proteomics. 2023 Jul;22(7):100569. doi: 10.1016/j.mcpro.2023.100569. Epub 2023 May 16.

A resample-replace lasso procedure for combining high-dimensional markers with limit of detection.一种用于结合具有检测限的高维标记物的重采样替换套索方法。

J Appl Stat. 2021 Sep 22;49(16):4278-4293. doi: 10.1080/02664763.2021.1977785. eCollection 2022.

On Comprehensive Mass Spectrometry Data Analysis for Proteome Profiling of Human Blood Samples.关于人类血液样本蛋白质组分析的综合质谱数据分析

J Healthc Inform Res. 2018 May 22;2(3):305-318. doi: 10.1007/s41666-018-0022-0. eCollection 2018 Sep.

Nanoplasmonic immunosensor for the detection of SCG2, a candidate serum biomarker for the early diagnosis of neurodevelopmental disorder.用于检测 SCG2 的纳米等离子体免疫传感器，SCG2 是神经发育障碍早期诊断的候选血清生物标志物。

Sci Rep. 2021 Nov 23;11(1):22764. doi: 10.1038/s41598-021-02262-7.

Incorporating Machine Learning into Established Bioinformatics Frameworks.将机器学习纳入既定的生物信息学框架中。

Int J Mol Sci. 2021 Mar 12;22(6):2903. doi: 10.3390/ijms22062903.

Folded concave penalized learning of high-dimensional MRI data in Parkinson's disease.帕金森病高维 MRI 数据的折叠凹惩罚学习。

J Neurosci Methods. 2021 Jun 1;357:109157. doi: 10.1016/j.jneumeth.2021.109157. Epub 2021 Mar 26.

Pilot proteomic analysis of cerebrospinal fluid in Alzheimer's disease.阿尔茨海默病患者脑脊液的蛋白质组学初步分析。

Proteomics Clin Appl. 2021 May;15(2-3):e2000072. doi: 10.1002/prca.202000072. Epub 2021 Apr 26.

MALDI-TOF mass spectrometry on intact bacteria combined with a refined analysis framework allows accurate classification of MSSA and MRSA.基质辅助激光解吸电离飞行时间质谱法（MALDI-TOF MS）对完整细菌进行检测，并结合改良分析框架，可实现 MSSA 和 MRSA 的准确分类。

PLoS One. 2019 Jun 27;14(6):e0218951. doi: 10.1371/journal.pone.0218951. eCollection 2019.

The parameter sensitivity of random forests.随机森林的参数敏感性。

BMC Bioinformatics. 2016 Sep 1;17(1):331. doi: 10.1186/s12859-016-1228-x.

Folded concave penalized learning in identifying multimodal MRI marker for Parkinson's disease.用于识别帕金森病多模态磁共振成像标志物的折叠凹惩罚学习

J Neurosci Methods. 2016 Aug 1;268:1-6. doi: 10.1016/j.jneumeth.2016.04.016. Epub 2016 Apr 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于蛋白质生物标志物发现的数据分析策略：用于癌症检测的高维蛋白质组学数据剖析

A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献