Department of Data Analysis and Artificial Intelligence, Faculty of Computer Science, National Research University Higher School of Economics (HSE), 11 Pokrovsky Boulevard, Moscow 109028, Russian Federation.
J Proteome Res. 2020 Apr 3;19(4):1481-1490. doi: 10.1021/acs.jproteome.9b00736. Epub 2020 Mar 25.
Peptide-spectrum-match (PSM) scores used in database searching are calibrated to spectrum- or spectrum-peptide-specific null distributions. Some calibration methods rely on specific assumptions and use analytical models (e.g., binomial distributions), whereas other methods utilize exact empirical null distributions. The former may be inaccurate because of unjustified assumptions, while the latter are accurate, albeit computationally exhaustive. Here, we introduce a novel, nonparametric, heuristic PSM score calibration method, called Tailor, which calibrates PSM scores by dividing them with the top 100-quantile of the empirical, spectrum-specific null distributions (i.e., the score with an associated -value of 0.01 at the tail, hence the name) observed during database searching. Tailor does not require any optimization steps or long calculations; it does not rely on any assumptions on the form of the score distribution (i.e., if it is, e.g., binomial); however, it relies on our empirical observation that the mean and the variance of the null distributions are correlated. In our benchmark, we re-calibrated the match scores of XCorr from Crux, HyperScore scores from X!Tandem, and the -values from OMSSA with the Tailor method and obtained more spectrum annotations than with raw scores at any false discovery rate level. Moreover, Tailor provided slightly more annotations than -values of X!Tandem and OMSSA and approached the performance of the computationally exhaustive exact -value method for XCorr on spectrum data sets containing low-resolution fragmentation information (MS2) around 20-150 times faster. On high-resolution MS2 data sets, the Tailor method with XCorr achieved state-of-the-art performance and produced more annotations than the well-calibrated residue-evidence (Res-ev) score around 50-80 times faster.
肽段谱匹配(PSM)分数在数据库搜索中经过校准,以与谱或谱肽特定的零分布相匹配。一些校准方法依赖于特定的假设,并使用分析模型(例如二项式分布),而其他方法则利用精确的经验零分布。前者可能由于不合理的假设而不准确,而后者则是准确的,尽管计算量很大。在这里,我们引入了一种新的、非参数的启发式 PSM 分数校准方法,称为 Tailor,它通过将 PSM 分数除以数据库搜索过程中观察到的经验、谱特异性零分布的前 100 分位数(即,与 0.01 的关联分数值在尾部,因此得名)来校准 PSM 分数。Tailor 不需要任何优化步骤或长时间的计算;它不依赖于分数分布形式的任何假设(即,如果它是例如二项式);然而,它依赖于我们的经验观察,即零分布的均值和方差是相关的。在我们的基准测试中,我们使用 Tailor 方法重新校准了来自 Crux 的 XCorr 的匹配分数、来自 X!Tandem 的 HyperScore 分数和来自 OMSSA 的 -值,并在任何错误发现率水平下获得了比原始分数更多的谱注释。此外,Tailor 提供的注释比 X!Tandem 和 OMSSA 的 -值略多,并且在包含低分辨率碎片化信息(MS2)的谱数据集上接近计算量极大的精确 -值方法的性能,速度快 20-150 倍左右。在高分辨率 MS2 数据集上,带有 XCorr 的 Tailor 方法实现了最先进的性能,并在大约 50-80 倍的速度内比经过良好校准的残基证据(Res-ev)分数产生了更多的注释。