Higdon Roger, Hogan Jason M, Kolker Natali, van Belle Gerald, Kolker Eugene
Seattle Children's Hospital and Regional Medical Center, Seattle, WA 98101, USA.
OMICS. 2007 Winter;11(4):351-65. doi: 10.1089/omi.2007.0040.
Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database.
准确可靠地确定肽段和蛋白质鉴定的错误率,对于高通量蛋白质组学实验的评估和交叉比较是必要的。目前,肽段鉴定要么基于预设的评分阈值,要么基于在与实验结果通常不相似的数据集上训练的概率模型。这些预设阈值或模型的错误发现率(FDR)和肽段鉴定概率在不同的实验处理、生物体或特定实验中使用的仪器之间往往有很大差异。为了克服这些困难,已使用随机数据库来估计FDR。然而,当有大量肽段鉴定时,累积FDR可能包括低概率鉴定,而当鉴定数量很少时,则可能排除高概率鉴定。为了克服这种逻辑不一致性,本研究扩展了随机数据库的使用,以生成肽段鉴定概率的实验特异性估计值。这些实验特异性概率是通过对从原始数据库匹配和重新排列的数据库匹配中获得的肽段得分进行逻辑回归和局部加权回归模型生成的。基于不同实验中已知的标准蛋白质混合物,这些实验特异性概率被证明能很好地近似“真实”概率。早期的Peptide_Prophet和最近的LIPS模型生成的概率与本研究的实验特异性概率有显著差异,尤其是对于未知样品。实验特异性概率可靠地估计了肽段鉴定的准确性,并克服了累积FDR的潜在逻辑不一致性。使用Sequest数据库搜索、LIPS模型和重新排列的数据库证明了这种估计方法。然而,当使用随机数据库时,这种方法通常适用于任何搜索算法、肽段评分和统计模型。