Lux Dominik, Marcus-Alic Katrin, Eisenacher Martin, Uszkoreit Julian
Ruhr University Bochum, Medical Faculty, Medizinisches Proteom-Center, Gesundheitscampus 4, 44801 Bochum, Germany.
Ruhr University Bochum, Medical Faculty, Center for Protein Diagnostics (PRODI), Gesundheitscampus 4, 44801 Bochum, Germany.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae671.
Due to computational resource limitations, in mass spectrometry based proteomics only a limited set of peptide sequences is used for the matching against measured spectra. We present an approach to represent proteins by graphs and allow not only the canonical sequences but also known isoforms and annotated amino acid variations, e.g. originating from genomic mutations, and further common protein sequence features contained in Uniprot KB or other protein databases. Our C++ and Python implementation enables a groundbreaking comprehensive characterization of the peptide search space, encompassing for the first time all available annotations in a protein database (in combination more than $10^{200}$ possibilities). Additionally, it can be used to quickly extract the relevant subset of the search space for peptide to spectrum matching, e.g. filtering by the peptide mass. We demonstrate the advantages and innovative findings of our implementation compared to previous workflows by re-analysing publicly available datasets.
由于计算资源的限制,在基于质谱的蛋白质组学中,只有有限的一组肽序列用于与测量光谱进行匹配。我们提出了一种用图形表示蛋白质的方法,不仅允许使用标准序列,还允许使用已知的异构体和注释的氨基酸变异,例如源自基因组突变的变异,以及包含在UniProt KB或其他蛋白质数据库中的其他常见蛋白质序列特征。我们用C++和Python实现的方法能够对肽搜索空间进行开创性的全面表征,首次涵盖了蛋白质数据库中的所有可用注释(组合起来有超过(10^{200})种可能性)。此外,它可用于快速提取肽与光谱匹配搜索空间的相关子集,例如按肽质量进行过滤。我们通过重新分析公开可用的数据集,展示了我们的实现方法与以前的工作流程相比的优势和创新性发现。