Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354 Freising, Germany.
J Proteome Res. 2013 Jun 7;12(6):2386-98. doi: 10.1021/pr400215r. Epub 2013 May 14.
Protein sequence databases are indispensable tools for life science research including mass spectrometry (MS)-based proteomics. In current database construction processes, sequence similarity clustering is used to reduce redundancies in the source data. Albeit powerful, it ignores the peptide-centric nature of proteomic data and the fact that MS is able to distinguish similar sequences. Therefore, we introduce an approach that structures the protein sequence space at the peptide level using theoretical and empirical information from large-scale proteomic data to generate a mass spectrometry-centric protein sequence database (MScDB). The core modules of MScDB are an in-silico proteolytic digest and a peptide-centric clustering algorithm that groups protein sequences that are indistinguishable by mass spectrometry. Analysis of various MScDB uses cases against five complex human proteomes, resulting in 69 peptide identifications not present in UniProtKB as well as 79 putative single amino acid polymorphisms. MScDB retains ~99% of the identifications in comparison to common databases despite a 3-48% increase in the theoretical peptide search space (but comparable protein sequence space). In addition, MScDB enables cross-species applications such as human/mouse graft models, and our results suggest that the uncertainty in protein assignments to one species can be smaller than 20%.
蛋白质序列数据库是生命科学研究(包括基于质谱的蛋白质组学)不可或缺的工具。在当前的数据库构建过程中,序列相似性聚类用于减少源数据中的冗余。尽管它功能强大,但它忽略了蛋白质组学数据以肽为中心的性质以及 MS 能够区分相似序列的事实。因此,我们引入了一种方法,该方法使用来自大规模蛋白质组学数据的理论和经验信息在肽水平上构建蛋白质序列空间,以生成基于质谱的蛋白质序列数据库(MScDB)。MScDB 的核心模块是理论蛋白水解消化和肽中心聚类算法,该算法将质谱无法区分的蛋白质序列分组。对各种 MScDB 用例的分析针对五个复杂的人类蛋白质组,结果在 UniProtKB 中未发现 69 种肽鉴定,以及 79 种可能的单个氨基酸多态性。与常见数据库相比,MScDB 保留了约 99%的鉴定,尽管理论肽搜索空间增加了 3-48%(但蛋白质序列空间相当)。此外,MScDB 支持跨物种应用,例如人类/小鼠移植物模型,我们的结果表明,一种物种的蛋白质分配的不确定性可以小于 20%。