Shilov Ignat V, Seymour Sean L, Patel Alpesh A, Loboda Alex, Tang Wilfred H, Keating Sean P, Hunter Christie L, Nuwaysir Lydia M, Schaeffer Daniel A
Applied Biosystems/MDS Sciex, Foster City, CA 94404, USA.
Mol Cell Proteomics. 2007 Sep;6(9):1638-55. doi: 10.1074/mcp.T600050-MCP200. Epub 2007 May 27.
The Paragon Algorithm, a novel database search engine for the identification of peptides from tandem mass spectrometry data, is presented. Sequence Temperature Values are computed using a sequence tag algorithm, allowing the degree of implication by an MS/MS spectrum of each region of a database to be determined on a continuum. Counter to conventional approaches, features such as modifications, substitutions, and cleavage events are modeled with probabilities rather than by discrete user-controlled settings to consider or not consider a feature. The use of feature probabilities in conjunction with Sequence Temperature Values allows for a very large increase in the effective search space with only a very small increase in the actual number of hypotheses that must be scored. The algorithm has a new kind of user interface that removes the user expertise requirement, presenting control settings in the language of the laboratory that are translated to optimal algorithmic settings. To validate this new algorithm, a comparison with Mascot is presented for a series of analogous searches to explore the relative impact of increasing search space probed with Mascot by relaxing the tryptic digestion conformance requirements from trypsin to semitrypsin to no enzyme and with the Paragon Algorithm using its Rapid mode and Thorough mode with and without tryptic specificity. Although they performed similarly for small search space, dramatic differences were observed in large search space. With the Paragon Algorithm, hundreds of biological and artifact modifications, all possible substitutions, and all levels of conformance to the expected digestion pattern can be searched in a single search step, yet the typical cost in search time is only 2-5 times that of conventional small search space. Despite this large increase in effective search space, there is no drastic loss of discrimination that typically accompanies the exploration of large search space.
本文介绍了Paragon算法,这是一种用于从串联质谱数据中识别肽段的新型数据库搜索引擎。序列温度值使用序列标签算法计算,可在连续统一体上确定数据库每个区域的MS/MS谱图的隐含程度。与传统方法不同,修饰、替换和裂解事件等特征采用概率建模,而非通过离散的用户控制设置来决定是否考虑某个特征。将特征概率与序列温度值结合使用,可在有效搜索空间大幅增加的情况下,仅使必须评分的假设实际数量略有增加。该算法具有一种新型用户界面,消除了对用户专业知识的要求,以实验室语言呈现控制设置,并将其转换为最佳算法设置。为验证这一新算法,针对一系列类似搜索,将其与Mascot进行了比较,以探讨通过放宽胰蛋白酶消化一致性要求(从胰蛋白酶到半胰蛋白酶再到无酶)来扩大Mascot搜索空间的相对影响,以及Paragon算法在使用其快速模式和彻底模式时有无胰蛋白酶特异性的情况。尽管在小搜索空间中它们的表现相似,但在大搜索空间中观察到了显著差异。使用Paragon算法,在单个搜索步骤中可以搜索数百种生物和人为修饰、所有可能的替换以及与预期消化模式的所有一致性水平,而典型的搜索时间成本仅为传统小搜索空间的2 - 5倍。尽管有效搜索空间大幅增加,但在探索大搜索空间时通常不会出现显著的辨别力损失。