From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois;.
From the ‡Proteomics Center of Excellence, Northwestern University, Evanston, Illinois.
Mol Cell Proteomics. 2019 Apr;18(4):796-805. doi: 10.1074/mcp.RA118.000993. Epub 2019 Jan 15.
Within the last several years, top-down proteomics has emerged as a high throughput technique for protein and proteoform identification. This technique has the potential to identify and characterize thousands of proteoforms within a single study, but the absence of accurate false discovery rate (FDR) estimation could hinder the adoption and consistency of top-down proteomics in the future. In automated identification and characterization of proteoforms, FDR calculation strongly depends on the context of the search. The context includes MS data quality, the database being interrogated, the search engine, and the parameters of the search. Particular to top-down proteomics-there are four molecular levels of study: proteoform spectral match (PrSM), protein, isoform, and proteoform. Here, a context-dependent framework for calculating an accurate FDR at each level was designed, implemented, and validated against a manually curated training set with 546 confirmed proteoforms. We examined several search contexts and found that an FDR calculated at the PrSM level under-reported the true FDR at the protein level by an average of 24-fold. We present a new open-source tool, the TDCD_FDR_Calculator, which provides a scalable, context-dependent FDR calculation that can be applied post-search to enhance the quality of results in top-down proteomics from any search engine.
在过去的几年中,自上而下的蛋白质组学已成为一种高通量的蛋白质和蛋白质翻译后修饰鉴定技术。该技术有可能在单个研究中鉴定和表征数千种蛋白质翻译后修饰,但缺乏准确的错误发现率 (FDR) 估计可能会阻碍自上而下的蛋白质组学在未来的采用和一致性。在蛋白质翻译后修饰的自动鉴定和表征中,FDR 计算强烈依赖于搜索的上下文。上下文包括 MS 数据质量、被查询的数据库、搜索引擎以及搜索的参数。对于自上而下的蛋白质组学来说,有四个研究的分子水平:蛋白质翻译后修饰谱匹配 (PrSM)、蛋白质、异构体和蛋白质翻译后修饰。在这里,设计、实现了一个针对每个水平的准确 FDR 计算的上下文相关框架,并与一个包含 546 个确认的蛋白质翻译后修饰的手动整理训练集进行了验证。我们检查了几种搜索环境,发现平均而言,在 PrSM 水平计算的 FDR 比蛋白质水平的真实 FDR 低 24 倍。我们提出了一个新的开源工具,即 TDCD_FDR_Calculator,它提供了一种可扩展的、上下文相关的 FDR 计算方法,可在搜索后应用于增强来自任何搜索引擎的自上而下的蛋白质组学结果的质量。