EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
Mol Cell Proteomics. 2011 Sep;10(9):M111.008490. doi: 10.1074/mcp.M111.008490. Epub 2011 Jun 23.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.
在蛋白质组学中,蛋白质鉴定是使用不稳定的参考系统(即蛋白质标识符)报告和存储的。这些专有的标识符是由每个蛋白质数据库单独创建的,可能会随着时间的推移而改变甚至被删除。为了评估所搜索的蛋白质序列数据库对蛋白质组学数据长期存储的影响,我们分析了截至 2010 年 11 月 Proteomics Identifications (PRIDE) 数据库中所有公共实验报告的蛋白质标识符的变化情况。为了将提交的蛋白质标识符映射到当前活跃的条目,我们使用了两种不同的方法。第一种方法使用 EBI 的 Protein Identifier Cross Referencing (PICR) 服务,该服务基于 100%的序列同一性来映射蛋白质标识符。第二种方法(称为逻辑映射算法)访问源数据库并检索报告标识符的当前状态。我们的分析显示了主要蛋白质数据库(国际蛋白质索引(IPI)、UniProt 知识库(UniProtKB)、国家生物技术信息中心 nr 数据库(NCBI nr)和 Ensembl)在标识符稳定性方面的差异。例如,在两年后,20%的提交的 IPI 条目被删除,而实际上所有的 UniProtKB 条目要么保持活跃,要么被替换。此外,两种映射算法产生了明显不同的结果。例如,与逻辑映射算法相比,PICR 服务报告有 10%的 IPI 条目被删除。我们发现有几个实验在发表时已经包含了超过 10%的已删除标识符。我们还评估了这些数据集中肽鉴定的比例,这些肽鉴定仍然符合最初鉴定的蛋白质序列。最后,我们对来自 IPI、Ensembl 和 UniProtKB 的所有记录执行了相同的总体分析:每年使用两个版本,从 2005 年开始。这项分析首次展示了改变蛋白质标识符对蛋白质组学数据的真实影响。基于这些发现,UniProtKB 似乎是依赖蛋白质组学数据长期存储的应用程序的最佳数据库。