Suppr超能文献

一种使用串联质谱数据和蛋白质序列数据库进行蛋白质鉴定与验证的超几何概率模型。

A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases.

作者信息

Sadygov Rovshan G, Yates John R

机构信息

Department of Cell Biology, SR11, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA.

出版信息

Anal Chem. 2003 Aug 1;75(15):3792-8. doi: 10.1021/ac034157w.

Abstract

We present a new probability-based method for protein identification using tandem mass spectra and protein databases. The method employs a hypergeometric distribution to model frequencies of matches between fragment ions predicted for peptide sequences with a specific (M + H)+ value (at some mass tolerance) in a protein sequence database and an experimental tandem mass spectrum. The hypergeometric distribution constitutes null hypothesis-all peptide matches to a tandem mass spectrum are random. It is used to generate a score characterizing the randomness of a database sequence match to an experimental tandem mass spectrum and to determine the level of significance of the null hypothesis. For each tandem mass spectrum and database search, a peptide is identified that has the least probability of being a random match to the spectrum and the corresponding level of significance of the null hypothesis is determined. To check the validity of the hypergeometric model in describing fragment ion matches, we used chi2 test. The distribution of frequencies and corresponding hypergeometric probabilities are generated for each tandem mass spectrum. No proteolytic cleavage specificity is used to create the peptide sequences from the database. We do not use any empirical probabilities in this method. The scores generated by the hypergeometric model do not have a significant molecular weight bias and are reasonably independent of database size. The approach has been implemented in a database search algorithm, PEP_PROBE. By using a large set of tandem mass spectra derived from a set of peptides created by digestion of a collection of known proteins using four different proteases, a false positive rate of 5% is demonstrated.

摘要

我们提出了一种基于概率的新方法,用于利用串联质谱和蛋白质数据库进行蛋白质鉴定。该方法采用超几何分布来模拟在蛋白质序列数据库中针对具有特定(M + H)+值(在一定质量容差范围内)的肽序列预测的碎片离子与实验串联质谱之间的匹配频率。超几何分布构成了零假设——所有与串联质谱的肽匹配都是随机的。它用于生成一个分数,表征数据库序列与实验串联质谱匹配的随机性,并确定零假设的显著性水平。对于每个串联质谱和数据库搜索,鉴定出与该谱随机匹配概率最小的肽,并确定零假设相应的显著性水平。为了检验超几何模型在描述碎片离子匹配方面的有效性,我们使用了卡方检验。为每个串联质谱生成频率分布和相应的超几何概率。在从数据库创建肽序列时不使用蛋白水解切割特异性。在该方法中我们不使用任何经验概率。超几何模型生成的分数没有明显的分子量偏差,并且在很大程度上与数据库大小无关。该方法已在数据库搜索算法PEP_PROBE中实现。通过使用大量由使用四种不同蛋白酶消化一组已知蛋白质产生的肽所得到的串联质谱,证明了5%的假阳性率。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验