作为所搜索蛋白质序列数量的函数的蛋白质鉴定结果的统计学显著性。

The statistical significance of protein identification results as a function of the number of protein sequences searched.

作者信息

Eriksson Jan, Fenyö David

机构信息

Swedish University of Agricultural Sciences, Box 7015, SE-750 07 Uppsala, Sweden.

出版信息

J Proteome Res. 2004 Sep-Oct;3(5):979-82. doi: 10.1021/pr0499343.

DOI:10.1021/pr0499343

PMID:15473685

Abstract

The potential for obtaining a true mass spectrometric protein identification result depends on the choice of algorithm as well as on experimental factors that influence the information content in the mass spectrometric data. Current methods can never prove definitively that a result is true, but an appropriate choice of algorithm can provide a measure of the statistical risk that a result is false, i.e., the statistical significance. We recently demonstrated an algorithm, Probity, which assigns the statistical significance to each result. For any choice of algorithm, the difficulty of obtaining statistically significant results depends on the number of protein sequences in the sequence collection searched. By simulations of random protein identifications and using the Probity algorithm, we here demonstrate explicitly how the statistical significance depends on the number of sequences searched. We also provide an example on how the practitioner's choice of taxonomic constraints influences the statistical significance.

摘要

获得真正的质谱蛋白质鉴定结果的可能性取决于算法的选择以及影响质谱数据信息含量的实验因素。目前的方法永远无法确凿地证明一个结果是真实的，但合适的算法选择可以提供一个衡量结果为假的统计风险的指标，即统计显著性。我们最近展示了一种算法，即Probity，它能为每个结果赋予统计显著性。对于任何算法选择，获得具有统计显著性结果的难度取决于所搜索序列集合中的蛋白质序列数量。通过随机蛋白质鉴定的模拟并使用Probity算法，我们在此明确展示了统计显著性如何取决于所搜索的序列数量。我们还提供了一个示例，说明从业者对分类学限制的选择如何影响统计显著性。