The Dental Center of China-Japan Friendship Hospital, Beijing, China.
ShenZhen Research Institute of Big Data, ShenZhen, China.
BMC Genomics. 2018 Aug 13;19(Suppl 6):567. doi: 10.1186/s12864-018-4923-3.
In mass spectrometry-based proteomics, protein identification is an essential task. Evaluating the statistical significance of the protein identification result is critical to the success of proteomics studies. Controlling the false discovery rate (FDR) is the most common method for assuring the overall quality of the set of identifications. Existing FDR estimation methods either rely on specific assumptions or rely on the two-stage calculation process of first estimating the error rates at the peptide-level, and then combining them somehow at the protein-level. We propose to estimate the FDR in a non-parametric way with less assumptions and to avoid the two-stage calculation process.
We propose a new protein-level FDR estimation framework. The framework contains two major components: the Permutation+BH (Benjamini-Hochberg) FDR estimation method and the logistic regression-based null inference method. In Permutation+BH, the null distribution of a sample is generated by searching data against a large number of permuted random protein database and therefore does not rely on specific assumptions. Then, p-values of proteins are calculated from the null distribution and the BH procedure is applied to the p-values to achieve the relationship of the FDR and the number of protein identifications. The Permutation+BH method generates the null distribution by the permutation method, which is inefficient for online identification. The logistic regression model is proposed to infer the null distribution of a new sample based on existing null distributions obtained from the Permutation+BH method.
In our experiment based on three public available datasets, our Permutation+BH method achieves consistently better performance than MAYU, which is chosen as the benchmark FDR calculation method for this study. The null distribution inference result shows that the logistic regression model achieves a reasonable result both in the shape of the null distribution and the corresponding FDR estimation result.
在基于质谱的蛋白质组学中,蛋白质鉴定是一项必不可少的任务。评估蛋白质鉴定结果的统计显著性对于蛋白质组学研究的成功至关重要。控制假发现率(FDR)是确保鉴定集整体质量的最常用方法。现有的 FDR 估计方法要么依赖于特定的假设,要么依赖于首先估计肽级别的错误率,然后以某种方式在蛋白质级别组合它们的两阶段计算过程。我们建议以较少的假设和避免两阶段计算过程的非参数方式估计 FDR。
我们提出了一种新的蛋白质水平 FDR 估计框架。该框架包含两个主要组件:置换+BH(Benjamini-Hochberg)FDR 估计方法和基于逻辑回归的无效推断方法。在置换+BH 中,通过对大量置换随机蛋白质数据库进行搜索来生成样本的零分布,因此不依赖于特定的假设。然后,从零分布计算蛋白质的 p 值,并应用 BH 过程对 p 值进行处理,以获得 FDR 和蛋白质鉴定数量之间的关系。置换+BH 方法通过置换方法生成零分布,对于在线鉴定效率不高。我们提出了逻辑回归模型,基于置换+BH 方法获得的现有零分布来推断新样本的零分布。
在我们基于三个公共可用数据集的实验中,我们的置换+BH 方法的性能始终优于 MAYU,后者被选为本研究中 FDR 计算方法的基准。零分布推断结果表明,逻辑回归模型在零分布的形状和相应的 FDR 估计结果方面都取得了合理的结果。