Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA.
Illumina Inc., San Diego, CA 92122, USA.
Bioinformatics. 2020 Dec 30;36(Suppl_2):i745-i753. doi: 10.1093/bioinformatics/btaa807.
Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra.
We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms.
https://github.com/shawn-peng/FDR-estimation.
Supplementary data are available at Bioinformatics online.
准确估计基于质谱的蛋白质组学中谱识别的错误发现率(FDR)是一个核心问题。在过去的二十年中,目标诱饵方法(TDA)和无诱饵方法(DFA)已被广泛用于估计 FDR。TDA 使用诱饵物种数据库来忠实地模拟错误肽谱匹配(PSM)的分数分布。另一方面,DFA 拟合双成分混合模型以学习正确和错误 PSM 分数分布的参数。虽然概念上很简单,但这两种方法在实践中都会导致问题,特别是在将仪器推至极限且产生低碎片化效率和低信噪比谱的实验中。
我们引入了一种新的无诱饵 FDR 估计框架,该框架在利用类似于 TDA 的方式在更多搜索数据的同时推广了当前的 DFA。我们的方法依赖于多成分混合物,其中正确 PSM、最佳错误 PSM 和第二佳错误 PSM 的分数分布由偏态正态族建模。我们从与每个实验谱相关的最佳和第二佳 PSM 的分数中推导出 EM 算法来估计这些分布的参数。我们在多个蛋白质组学数据集和一个包含超过一百万谱的 HeLa 细胞消化案例研究上评估了我们的模型。我们提供了改进现有 DFA 性能的证据,并在没有任何性能下降的情况下提高了稳定性和速度,超过了 TDA。我们提出,新策略有可能扩展到肽识别之外,并减少所有分析平台对 TDA 的需求。
https://github.com/shawn-peng/FDR-estimation。
补充数据可在 Bioinformatics 在线获取。