从串联质谱推导氨基酸失水和失氨的概率。

Deriving the probabilities of water loss and ammonia loss for amino acids from tandem mass spectra.

作者信息

Sun Shiwei, Yu Chungong, Qiao Yantao, Lin Yu, Dong Gongjin, Liu Changning, Zhang Jingfen, Zhang Zhuo, Cai Jinjin, Zhang Hong, Bu Dongbo

机构信息

Bioinformatics Group, Center for Advanced Computing Research, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China.

出版信息

J Proteome Res. 2008 Jan;7(1):202-8. doi: 10.1021/pr070479v. Epub 2007 Dec 20.

DOI:10.1021/pr070479v

PMID:18092745

Abstract

In protein identification through tandem mass spectrometry, it is critical to accurately predict the theoretical spectrum for a peptide sequence. The widely used prediction models, such as SEQUEST and MASCOT, ignore the intensity of the ions with important neutral losses, including water loss and ammonia loss. However, ignoring these neutral losses results in a significant deviation between the predicted theoretical spectrum and its experimental counterpart. Here, based on the "one peak, multiple explanations" observation, we proposed an expectation-maximization (EM) method to automatically learn the probabilities of water loss and ammonia loss for each amino acid. Then we employed these probabilities to design an improved statistical model for theoretical spectrum prediction. We implemented these methods and tested them on practical data. On a training set containing 1803 spectra, the experimental results show a good agreement with some known knowledge about neutral losses, such as the tendency of water loss from Asp, Glu, Ser, and Thr. Furthermore, on a testing set containing 941 spectra, the improved similarity between the experimental and predicted spectra demonstrates that this method can generate more reasonable predictions relative to the model that ignores neutral losses. As an application of the derived probabilities, we implemented a database searching method adopting the improved theoretical spectrum model with neutral loss ions estimated. Experimental results on Keller's data set demonstrate that this method can identify peptides more accurately than SEQUEST. In another application to validate SEQUEST's results, the reported peptide-spectrum pairs are reranked with respect to the similarity between experimental and predicted spectra. Experimental results on both LTQ and QSTAR data sets suggest that this reranking strategy can effectively distinguish the false negative predictions reported by SEQUEST.

摘要

在通过串联质谱进行蛋白质鉴定时，准确预测肽序列的理论谱至关重要。广泛使用的预测模型，如SEQUEST和MASCOT，忽略了具有重要中性丢失的离子强度，包括水丢失和氨丢失。然而，忽略这些中性丢失会导致预测的理论谱与其实验谱之间存在显著偏差。在此，基于“一个峰，多种解释”的观察结果，我们提出了一种期望最大化（EM）方法，以自动学习每个氨基酸的水丢失和氨丢失概率。然后，我们利用这些概率设计了一种改进的理论谱预测统计模型。我们实现了这些方法并在实际数据上进行了测试。在包含1803个谱的训练集上，实验结果与一些关于中性丢失的已知知识，如天冬氨酸、谷氨酸、丝氨酸和苏氨酸的水丢失倾向，显示出良好的一致性。此外，在包含941个谱的测试集上，实验谱与预测谱之间改进的相似性表明，相对于忽略中性丢失的模型，该方法可以产生更合理的预测。作为导出概率的一个应用，我们实现了一种数据库搜索方法，采用估计了中性丢失离子的改进理论谱模型。在凯勒数据集上的实验结果表明，该方法比SEQUEST能更准确地鉴定肽。在另一个验证SEQUEST结果的应用中，根据实验谱与预测谱之间的相似性对报告的肽-谱对进行重新排序。在LTQ和QSTAR数据集上的实验结果表明，这种重新排序策略可以有效地区分SEQUEST报告的假阴性预测。