Wang Guanghui, Wu Wells W, Zhang Zheng, Masilamani Shyama, Shen Rong-Fong
Proteomics Core Facility, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, USA.
Anal Chem. 2009 Jan 1;81(1):146-59. doi: 10.1021/ac801664q.
The potential of getting a significant number of false positives (FPs) in peptide-spectrum matches (PSMs) obtained by proteomic database search has been well-recognized. Among the attempts to assess FPs, the concomitant use of target and decoy databases is widely practiced. By adjusting filtering criteria, FPs and false discovery rate (FDR) can be controlled at a desired level. Although the target-decoy approach is gaining in popularity, subtle differences in decoy construction (e.g., reversing vs stochastic methods), rate calculation (e.g., total vs unique PSMs), or searching (separate vs composite) do exist among various implementations. In the present study, we evaluated the effects of these differences on FP and FDR estimations using a rat kidney protein sample and the SEQUEST search engine as an example. On the effects of decoy construction, we found that, when a single scoring filter (XCorr) was used, stochastic methods generated a higher estimation of FPs and FDR than sequence reversing methods, likely due to an increase in unique peptides. This higher estimation could largely be attenuated by creating decoy databases similar in effective size but not by a simple normalization with a unique-peptide coefficient. When multiple filters were applied, the differences seen between reversing and stochastic methods significantly diminished, suggesting multiple filterings reduce the dependency on how a decoy is constructed. For a fixed set of filtering criteria, FDR and FPs estimated by using unique PSMs were almost twice those using total PSMs. The higher estimation seemed to be dependent on data acquisition setup. As to the differences between performing separate or composite searches, in general, FDR estimated from the separate search was about three times that from the composite search. The degree of difference gradually decreased as the filtering criteria became more stringent. Paradoxically, the estimated true positives in separate search were higher when multiple filters were used. By analyzing a standard protein mixture, we demonstrated that the higher estimation of FDR and FPs in the separate search likely reflected an overestimation, which could be corrected with a simple merging procedure. Our study illustrates the relative merits of different implementations of the target-decoy strategy, which should be worth contemplating when large-scale proteomic biomarker discovery is to be attempted.
蛋白质组数据库搜索得到的肽段谱匹配(PSM)中出现大量假阳性(FP)的可能性已得到充分认识。在评估假阳性的尝试中,同时使用目标数据库和诱饵数据库的做法被广泛采用。通过调整过滤标准,可以将假阳性和错误发现率(FDR)控制在期望的水平。尽管目标-诱饵方法越来越受欢迎,但不同的实现方式在诱饵构建(例如,反向与随机方法)、比率计算(例如,总PSM与唯一PSM)或搜索(单独与复合)方面确实存在细微差异。在本研究中,我们以大鼠肾脏蛋白质样本和SEQUEST搜索引擎为例,评估了这些差异对FP和FDR估计的影响。关于诱饵构建的影响,我们发现,当使用单个评分过滤器(XCorr)时,随机方法产生的FP和FDR估计值高于序列反向方法,这可能是由于唯一肽段数量增加所致。通过创建有效大小相似的诱饵数据库,这种较高的估计值在很大程度上可以得到缓解,但不能通过简单地用唯一肽段系数进行归一化来实现。当应用多个过滤器时,反向和随机方法之间的差异显著减小,这表明多次过滤减少了对诱饵构建方式的依赖。对于一组固定的过滤标准,使用唯一PSM估计的FDR和FP几乎是使用总PSM估计值的两倍。这种较高的估计似乎取决于数据采集设置。至于单独搜索和复合搜索之间的差异,一般来说,单独搜索估计的FDR约为复合搜索的三倍。随着过滤标准变得更加严格,差异程度逐渐降低。矛盾的是,当使用多个过滤器时,单独搜索中估计的真阳性更高。通过分析标准蛋白质混合物,我们证明单独搜索中FDR和FP的较高估计可能反映了高估,这可以通过简单的合并程序进行校正。我们的研究说明了目标-诱饵策略不同实现方式的相对优点,在尝试大规模蛋白质组生物标志物发现时,这些优点值得考虑。