School of Software, Dalian University of Technology, Dalian 116621, China.
Bioinformatics. 2014 Mar 1;30(5):675-81. doi: 10.1093/bioinformatics/btt431. Epub 2013 Aug 6.
Statistical validation of protein identifications is an important issue in shotgun proteomics. The false discovery rate (FDR) is a powerful statistical tool for evaluating the protein identification result. Several research efforts have been made for FDR estimation at the protein level. However, there are still certain drawbacks in the existing FDR estimation methods based on the target-decoy strategy.
In this article, we propose a decoy-free protein-level FDR estimation method. Under the null hypothesis that each candidate protein matches an identified peptide totally at random, we assign statistical significance to protein identifications in terms of the permutation P-value and use these P-values to calculate the FDR. Our method consists of three key steps: (i) generating random bipartite graphs with the same structure; (ii) calculating the protein scores on these random graphs; and (iii) calculating the permutation P value and final FDR. As it is time-consuming or prohibitive to execute the protein inference algorithms for thousands of times in step ii, we first train a linear regression model using the original bipartite graph and identification scores provided by the target inference algorithm. Then we use the learned regression model as a substitute of original protein inference method to predict protein scores on shuffled graphs. We test our method on six public available datasets. The results show that our method is comparable with those state-of-the-art algorithms in terms of estimation accuracy.
The source code of our algorithm is available at: https://sourceforge.net/projects/plfdr/
在鸟枪法蛋白质组学中,蛋白质鉴定的统计验证是一个重要问题。错误发现率(FDR)是评估蛋白质鉴定结果的强大统计工具。已经进行了一些研究工作来估计蛋白质水平的 FDR。然而,基于目标诱饵策略的现有 FDR 估计方法仍然存在某些缺点。
在本文中,我们提出了一种无诱饵的蛋白质水平 FDR 估计方法。在每个候选蛋白质与随机识别的肽完全匹配的零假设下,我们根据置换 P 值为蛋白质鉴定分配统计学意义,并使用这些 P 值计算 FDR。我们的方法包括三个关键步骤:(i)生成具有相同结构的随机二分图;(ii)在这些随机图上计算蛋白质得分;(iii)计算置换 P 值和最终 FDR。由于在步骤 ii 中对数千次执行蛋白质推断算法既耗时又不可行,因此我们首先使用原始二分图和目标推断算法提供的鉴定得分训练线性回归模型。然后,我们使用学习的回归模型作为原始蛋白质推断方法的替代物,在打乱的图上预测蛋白质得分。我们在六个公共可用数据集上测试了我们的方法。结果表明,我们的方法在估计准确性方面与那些最先进的算法相当。
我们的算法的源代码可在:https://sourceforge.net/projects/plfdr/ 获得。