Forshed Jenny, Pernemalm Maria, Tan Chuen Seng, Lindberg Marita, Kanter Lena, Pawitan Yudi, Lewensohn Rolf, Stenke Leif, Lehtiö Janne
Clinical Proteomics, Karolinska Biomics Center, Karolinska University Hospital, Stockholm, Sweden.
J Proteome Res. 2008 Jun;7(6):2332-41. doi: 10.1021/pr070482e. Epub 2008 May 2.
Our goal in this paper is to show an analytical workflow for selecting protein biomarker candidates from SELDI-MS data. The clinical question at issue is to enable prediction of the complete remission (CR) duration for acute myeloid leukemia (AML) patients. This would facilitate disease prognosis and make individual therapy possible. SELDI-mass spectrometry proteomics analyses were performed on blast cell samples collected from AML patients pre-chemotherapy. Although the biobank available included approximately 200 samples, only 58 were available for analysis. The presented workflow includes sample selection, experimental optimization, repeatability estimation, data preprocessing, data fusion, and feature selection. Specific difficulties have been the small number of samples and the skew distribution of the CR duration among the patients. Further, we had to deal with both noisy SELDI-MS data and a diverse patient cohort. This has been handled by sample selection and several methods for data preprocessing and feature detection in the analysis workflow. Four conceptually different methods for peak detection and alignment were considered, as well as two diverse methods for feature selection. The peak detection and alignment methods included the recently developed annotated regions of significance (ARS) method, the SELDI-MS software Ciphergen Express which was regarded as the standard method, segment-wise spectral alignment by a genetic algorithm (PAGA) followed by binning, and, finally, binning of raw data. In the feature selection, the "standard" Mann-Whitney t test was compared with a hierarchical orthogonal partial least-squares (O-PLS) analysis approach. The combined information from all these analyses gave a collection of 21 protein peaks. These were regarded as the most potential and robust biomarker candidates since they were picked out as significant features in several of the models. The chosen peaks will now be our first choice for the continuing work on protein identification and biological validation. The identification will be performed by chromatographic purification and MALDI MS/MS. Thus, we have shown that the use of several data handling methods can improve a protein profiling workflow from experimental optimization to a predictive model. The framework of this methodology should be seen as general and could be used with other one-dimensional spectral omics data than SELDI MS including an adequate number of samples.
本文的目标是展示一种从表面增强激光解吸电离飞行时间质谱(SELDI-MS)数据中选择蛋白质生物标志物候选物的分析流程。所讨论的临床问题是实现对急性髓系白血病(AML)患者完全缓解(CR)持续时间的预测。这将有助于疾病预后并使个体化治疗成为可能。对化疗前从AML患者采集的原始细胞样本进行了SELDI-质谱蛋白质组学分析。尽管可用的生物样本库包含约200个样本,但仅有58个可用于分析。所展示的流程包括样本选择、实验优化、重复性估计、数据预处理、数据融合和特征选择。具体困难在于样本数量少以及患者中CR持续时间的偏态分布。此外,我们必须处理有噪声的SELDI-MS数据和多样化的患者队列。这已通过样本选择以及分析流程中数据预处理和特征检测的几种方法来解决。考虑了四种概念上不同的峰检测和对齐方法,以及两种不同的特征选择方法。峰检测和对齐方法包括最近开发的显著性注释区域(ARS)方法、被视为标准方法的SELDI-MS软件Ciphergen Express、通过遗传算法进行逐段光谱对齐(PAGA)然后进行分箱,以及最后对原始数据进行分箱。在特征选择方面,将“标准”的曼-惠特尼t检验与分层正交偏最小二乘法(O-PLS)分析方法进行了比较。所有这些分析的综合信息给出了21个蛋白质峰的集合。这些被视为最具潜力和稳健性的生物标志物候选物,因为它们在多个模型中被挑选为显著特征。所选的峰现在将是我们在蛋白质鉴定和生物学验证后续工作中的首选。鉴定将通过色谱纯化和基质辅助激光解吸电离串联质谱(MALDI MS/MS)进行。因此,我们已经表明,使用多种数据处理方法可以改进从实验优化到预测模型的蛋白质谱分析流程。该方法的框架应被视为通用的,并且可用于除SELDI MS之外的其他一维光谱组学数据,包括足够数量的样本。