Noy Karin, Fasulo Daniel
Integrated Data System Department, Siemens Corporate Research, 755 College Road East, Princeton, NJ 08540, USA.
Bioinformatics. 2007 Oct 1;23(19):2528-35. doi: 10.1093/bioinformatics/btm385. Epub 2007 Aug 13.
Mass spectrometry (MS) is increasingly being used for biomedical research. The typical analysis of MS data consists of several steps. Feature extraction is a crucial step since subsequent analyses are performed only on the detected features. Current methodologies applied to low-resolution MS, in which features are peaks or wavelet functions, are parameter-sensitive and inaccurate in the sense that peaks and wavelet functions do not directly correspond to the underlying molecules under observation. In high-resolution MS, the model-based approach is more appealing as it can provide a better representation of the MS signals by incorporating information about peak shapes and isotopic distributions. Current model-based techniques are computationally expensive; various algorithms have been proposed to improve the computational efficiency of this paradigm. However, these methods cannot deal well with overlapping features, especially when they are merged to create one broad peak. In addition, no method has been proven to perform well across different MS platforms.
We suggest a new model-based approach to feature extraction in which spectra are decomposed into a mixture of distributions derived from peptide models. By incorporating kernel-based smoothing and perceptual similarity for matching distributions, our statistical framework improves existing methodologies in terms of computational efficiency and the accuracy of the results. Our model is parameterized by physical properties and is therefore applicable to different MS instruments and settings. We validate our approach on simulated data, and show that the performance is higher than commonly used tools on real high- and low-resolution MS, and MS/MS data sets.
质谱(MS)越来越多地用于生物医学研究。质谱数据分析的典型流程包括几个步骤。特征提取是关键步骤,因为后续分析仅针对检测到的特征进行。应用于低分辨率质谱的当前方法中,特征是峰或小波函数,这些方法对参数敏感且不准确,因为峰和小波函数并不直接对应于所观察的潜在分子。在高分辨率质谱中,基于模型的方法更具吸引力,因为它可以通过纳入有关峰形和同位素分布的信息来更好地表示质谱信号。当前基于模型的技术计算成本高昂;已提出各种算法来提高该范式的计算效率。然而,这些方法不能很好地处理重叠特征,尤其是当它们合并形成一个宽峰时。此外,尚无方法被证明在不同的质谱平台上都能表现良好。
我们提出了一种新的基于模型的特征提取方法,其中光谱被分解为源自肽模型的分布混合物。通过纳入基于核的平滑和用于匹配分布的感知相似性,我们的统计框架在计算效率和结果准确性方面改进了现有方法。我们的模型由物理性质参数化,因此适用于不同的质谱仪器和设置。我们在模拟数据上验证了我们的方法,并表明其性能高于真实高分辨率和低分辨率质谱以及串联质谱数据集上常用的工具。