Pratapa Pallavi N, Patz Edward F, Hartemink Alexander J
Duke University, Dept. of Computer Science, Box 90129, Durham, NC 27708, USA.
Pac Symp Biocomput. 2006:279-90.
In seeking to find diagnostic biomarkers in proteomic spectra, two significant problems arise. First, not only is there noise in the measured intensity at each m/z value, but there is also noise in the measured m/z value itself. Second, the potential for overfitting is severe: it is easy to find features in the spectra that accurately discriminate disease states but have no biological meaning. We address these problems by developing and testing a series of steps for pre-processing proteomic spectra and extracting putatively meaningful features before presentation to feature selection and classification algorithms. These steps include an HMM-based latent spectrum extraction algorithm for fusing the information from multiple replicate spectra obtained from a single tissue sample, a simple algorithm for baseline correction based on a segmented convex hull, a peak identification and quantification algorithm, and a peak registration algorithm to align peaks from multiple tissue samples into common peak registers. We apply these steps to MALDI spectral data collected from normal and tumor lung tissue samples, and then compare the performance of feature selection with FDR followed by classification with an SVM, versus joint feature selection and classification with Bayesian sparse multinomial logistic regression (SMLR). The SMLR approach outperformed FDR+SVM, but both were effective in achieving good diagnostic accuracy with a small number of features. Some of the selected features have previously been investigated as clinical markers for lung cancer diagnosis; some of the remaining features are excellent candidates for further research.
在试图从蛋白质组学光谱中寻找诊断生物标志物时,出现了两个重大问题。首先,不仅在每个质荷比(m/z)值处的测量强度存在噪声,而且在测量的质荷比值本身也存在噪声。其次,过拟合的可能性很大:很容易在光谱中找到能准确区分疾病状态但没有生物学意义的特征。我们通过开发和测试一系列用于预处理蛋白质组学光谱并在将其呈现给特征选择和分类算法之前提取假定有意义特征的步骤来解决这些问题。这些步骤包括一种基于隐马尔可夫模型(HMM)的潜在光谱提取算法,用于融合从单个组织样本获得的多个重复光谱的信息;一种基于分段凸包的简单基线校正算法;一种峰识别和定量算法;以及一种峰配准算法,用于将来自多个组织样本的峰对齐到共同的峰寄存器中。我们将这些步骤应用于从正常和肿瘤肺组织样本收集的基质辅助激光解吸电离(MALDI)光谱数据,然后比较先进行错误发现率(FDR)特征选择再用支持向量机(SVM)分类的性能,与使用贝叶斯稀疏多项式逻辑回归(SMLR)进行联合特征选择和分类的性能。SMLR方法优于FDR + SVM,但两者在使用少量特征实现良好诊断准确性方面都很有效。一些选定的特征此前已作为肺癌诊断的临床标志物进行过研究;其余一些特征是进一步研究的优秀候选对象。