一种使用诱饵数据库计算肽段鉴定错误发现率的优化方法。

A refined method to calculate false discovery rates for peptide identification using decoy databases.

作者信息

Navarro Pedro, Vázquez Jesús

机构信息

Centro de Biología Molecular Severo Ochoa (CSIC), Universidad Autónoma de Madrid, 28049 Madrid, Spain.

出版信息

J Proteome Res. 2009 Apr;8(4):1792-6. doi: 10.1021/pr800362h.

DOI:10.1021/pr800362h

PMID:19714873

Abstract

Using decoy databases to estimate the number of false positive assignations is one of the most widely used methods to calculate false discovery rates in large-scale peptide identification studies. However, in spite of their widespread use, the decoy approach has not been fully standardized. In conjunction with target databases, decoy databases may be used separately or in the form of concatenated databases, allowing a competition strategy; depending on the method used, two alternative formulations are possible to calculate error rates. Although both methods are conservative, the separate database approach overestimates the number of false positive assignations due to the presence of MS/MS spectra produced by true peptides, while the concatenated approach calculates the error rate in a population that has a higher size than that obtained after searching a target database. In this work, we demonstrate that by analyzing as a whole the joint distribution of matches obtained after performing a separate database search, and applying the competition strategy, it is possible to make a more accurate calculation of false discovery rates. We show that both separate and concatenated approaches clearly overestimate error rates with respect to those calculated by the new algorithm, using several kinds of scores. We conclude that the new indicator provides a more sensitive alternative, and establishes for the first time a unique and integrated framework to calculate error rates in large-scale peptide identification studies.

摘要

使用诱饵数据库来估计假阳性匹配的数量是大规模肽段鉴定研究中计算错误发现率最广泛使用的方法之一。然而，尽管诱饵方法被广泛使用，但其尚未完全标准化。与目标数据库结合使用时，诱饵数据库可以单独使用或以串联数据库的形式使用，从而采用竞争策略；根据所使用的方法，有两种替代公式可用于计算错误率。虽然这两种方法都较为保守，但单独数据库方法由于存在真实肽段产生的二级质谱（MS/MS）图谱而高估了假阳性匹配的数量，而串联方法计算的错误率所基于的群体规模比搜索目标数据库后得到的群体规模更大。在这项工作中，我们证明，通过整体分析单独数据库搜索后获得的匹配的联合分布，并应用竞争策略，可以更准确地计算错误发现率。我们表明，使用几种评分方法时，单独和串联方法相对于新算法计算出的错误率都明显高估。我们得出结论，新指标提供了一种更敏感的替代方法，并首次建立了一个独特且综合的框架来计算大规模肽段鉴定研究中的错误率。