Wang Haipeng, Fu Yan, Sun Ruixiang, He Simin, Zeng Rong, Gao Wen
Digital Technology Lab, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China.
Pac Symp Biocomput. 2006:303-14.
Tandem mass spectrometry (MS/MS) has become increasingly important and indispensable in high-throughput proteomics for identifying complex protein mixtures. Database searching is the standard method to accomplish this purpose. A key sub-routine, peptide identification, is used to generate a list of candidate peptides from a protein database according to an experimental MS/MS spectrum, and then validate these candidate peptides for protein identification. Although currently there are many algorithms for peptide identification, most of them either lack an effective validation module or only validate the first-ranked peptide, thus leading to a low identification reliability or sensitivity. This paper proposes a new algorithm, named pepReap, to overcome the above drawbacks. It consists of a two-layered scoring scheme based on machine learning. The first layer is a rough scoring function which uses some simple and heuristic factors to measure the degree of the matches between an experimental MS/MS spectrum and the candidate peptides; thus a ranked list of candidate peptides is generated at a relatively low computational cost. The second layer is a fine scoring function which re-ranks the candidate peptides generated in the first layer and determines which one among them is the true positive. The fine scoring function was designed based on support vector machines (SVMs) using more comprehensive factors, such as the correlations between ions, the mass matching errors of fragment and peptide ions, etc. Consequently, the SVM classifier serves as not only a scorer but also a validation module. Experimental comparison with the popular SEQUEST algorithm coupled with threshold validation criteria on a reported dataset demonstrates that the pepReap algorithm achieves higher performance in terms of identification sensitivity with comparable precision.
串联质谱(MS/MS)在高通量蛋白质组学中对于鉴定复杂蛋白质混合物已变得越来越重要且不可或缺。数据库搜索是实现这一目的的标准方法。一个关键子例程——肽段鉴定,用于根据实验性MS/MS谱从蛋白质数据库生成候选肽段列表,然后验证这些候选肽段以进行蛋白质鉴定。尽管目前有许多用于肽段鉴定的算法,但它们大多要么缺乏有效的验证模块,要么仅验证排名第一的肽段,从而导致鉴定可靠性或灵敏度较低。本文提出了一种名为pepReap的新算法来克服上述缺点。它由基于机器学习的两层评分方案组成。第一层是一个粗略评分函数,它使用一些简单的启发式因素来衡量实验性MS/MS谱与候选肽段之间的匹配程度;从而以相对较低的计算成本生成候选肽段的排序列表。第二层是一个精细评分函数,它对在第一层中生成的候选肽段重新排序,并确定其中哪一个是真正的阳性肽段。精细评分函数基于支持向量机(SVM)设计,使用了更全面的因素,如离子之间的相关性、片段离子和肽段离子的质量匹配误差等。因此,SVM分类器不仅充当评分器,还充当验证模块。在一个已报道的数据集上与流行的SEQUEST算法结合阈值验证标准进行实验比较表明,pepReap算法在具有可比精度的情况下,在鉴定灵敏度方面实现了更高的性能。