基于实例的算法，通过目标-诱饵策略计算后验概率以提高蛋白质鉴定率。

Instance based algorithm for posterior probability calculation by target-decoy strategy to improve protein identifications.

作者信息

Jiang Xinning, Dong Xiaoli, Ye Mingliang, Zou Hanfa

机构信息

National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China.

出版信息

Anal Chem. 2008 Dec 1;80(23):9326-35. doi: 10.1021/ac8017229.

DOI:10.1021/ac8017229

PMID:19551949

Abstract

The target-decoy database search strategy is often applied to determine the global false-discovery rate (FDR) of peptide identifications in proteome research. However, the confidence of individual peptide identification is typically not determined. In this study, we introduced an approach for the calculation of posterior probability of individual peptide identification from the "local false-discovery rate" (local FDR), which is also determined based on a target-decoy database search. The peptide identification scores output by the database search algorithm were weighted by their discriminating power using a Shannon information entropy based strategy. Then the local FDR of a peptide identification was calculated based on the fraction of decoy identifications among its nearest neighbors within a small space defined by these weighted scores. It was demonstrated that the calculated probability matched the actual probability precisely, and it provided powerful discriminating performance between true positive and false positive identifications. Hence, the sensitivity for peptide identification as well as protein identification was significantly improved when the calculated probability was used to process different proteome data sets. As an instance based strategy, this algorithm provides a safe way for the posterior probability calculation and should work well for data sets with different characteristics.

摘要

在蛋白质组研究中，目标-诱饵数据库搜索策略常被用于确定肽段鉴定的全局错误发现率（FDR）。然而，单个肽段鉴定的可信度通常并未确定。在本研究中，我们引入了一种从“局部错误发现率”（local FDR）计算单个肽段鉴定后验概率的方法，该局部错误发现率也是基于目标-诱饵数据库搜索确定的。通过基于香农信息熵的策略，利用数据库搜索算法输出的肽段鉴定分数的区分能力对其进行加权。然后，根据在由这些加权分数定义的小空间内其最近邻中诱饵鉴定的比例，计算肽段鉴定的局部错误发现率。结果表明，计算得到的概率与实际概率精确匹配，并且在真阳性和假阳性鉴定之间提供了强大的区分性能。因此，当使用计算得到的概率处理不同的蛋白质组数据集时，肽段鉴定以及蛋白质鉴定的灵敏度都得到了显著提高。作为一种基于实例的策略，该算法为后验概率计算提供了一种可靠的方法，并且应该适用于具有不同特征的数据集。