Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, 44801 Bochum, Germany.
Ruhr University Bochum, Center for Protein Diagnostics (PRODI), Medical Proteome Analysis, 44801 Bochum, Germany.
J Proteome Res. 2021 Apr 2;20(4):2145-2150. doi: 10.1021/acs.jproteome.0c00967. Epub 2021 Mar 16.
Protein sequence databases play a crucial role in the majority of the currently applied mass-spectrometry-based proteomics workflows. Here UniProtKB serves as one of the major sources, as it combines the information of several smaller databases and enriches the entries with additional biological information. For the identification of peptides in a sample by tandem mass spectra, as generated by data-dependent acquisition, protein sequence databases provide the basis for most spectrum identification search engines. In addition, for targeted proteomics approaches like selected reaction monitoring (SRM) and parallel reaction monitoring (PRM), knowledge of the peptide sequences, their masses, and whether they are unique for a protein is essential. Because most bottom-up proteomics approaches use trypsin to cleave the proteins in a sample, the tryptic peptides contained in a protein database are of great interest. We present a database, called MaCPepDB (mass-centric peptide database), that consists of the complete tryptic digest of the Swiss-Prot and TrEMBL parts of UniProtKB. This database is especially designed to not only allow queries of peptide sequences and return the respective information about connected proteins and thus whether a peptide is unique but also allow queries of specific masses of peptides or precursors of MS/MS spectra. Furthermore, posttranslational modifications can be considered in a query as well as different mass deviations for posttranslational modifications. Hence the database can be used by a sequence query not only to, for example, check in which proteins of the UniProt database a tryptic peptide can be found but also to find possibly interfering peptides in PRM/SRM experiments using the mass query. The complete database contains currently 5 939 244 990 peptides from 185 561 610 proteins (UniProt version 2020_03), for which a single query usually takes less than 1 s. For easy exploration of the data, a web interface was developed. A REST application programming interface (API) for programmatic and workflow access is also available at https://macpepdb.mpc.rub.de.
蛋白质序列数据库在目前大多数基于质谱的蛋白质组学工作流程中起着至关重要的作用。在这里,UniProtKB 是主要来源之一,它结合了几个较小数据库的信息,并通过附加的生物学信息丰富了条目内容。对于通过数据依赖型采集生成的串联质谱对样品中肽的鉴定,蛋白质序列数据库为大多数谱识别搜索引擎提供了基础。此外,对于靶向蛋白质组学方法,如选择反应监测 (SRM) 和并行反应监测 (PRM),了解肽序列、它们的质量以及它们是否为蛋白质所特有是至关重要的。由于大多数自上而下的蛋白质组学方法使用胰蛋白酶来切割样品中的蛋白质,因此蛋白质数据库中包含的胰蛋白酶肽非常重要。我们提出了一个名为 MaCPepDB(质量中心肽数据库)的数据库,它由 UniProtKB 的 Swiss-Prot 和 TrEMBL 部分的完整胰蛋白酶消化物组成。这个数据库是专门设计的,不仅允许查询肽序列并返回与连接蛋白质相关的信息,从而确定肽是否独特,还允许查询特定质量的肽或 MS/MS 谱的前体。此外,还可以在查询中考虑翻译后修饰以及翻译后修饰的不同质量偏差。因此,该数据库不仅可以通过序列查询来检查 UniProt 数据库中的哪些蛋白质中可以找到胰蛋白酶肽,还可以通过质量查询在 PRM/SRM 实验中找到可能干扰的肽。完整的数据库目前包含 185561610 个蛋白质中的 5939244990 个肽(UniProt 版本 2020_03),单个查询通常不到 1 秒。为了方便探索数据,还开发了一个网络界面。还可以在 https://macpepdb.mpc.rub.de 上访问用于编程和工作流程访问的 REST 应用程序编程接口 (API)。