Horokhovskyi Yehor, Roetschke Hanna P, Cormican John A, Pašen Martin, Garazhian Sina, Mishto Michele, Liepe Juliane
Research Group of Quantitative and Systems Biology, Max-Planck-Institute for Multidisciplinary Sciences (MPI-NAT), Göttingen, Germany.
Research Group of Quantitative and Systems Biology, Max-Planck-Institute for Multidisciplinary Sciences (MPI-NAT), Göttingen, Germany; Centre for Inflammation Biology and Cancer Immunology, King's College London, London, United Kingdom; Peter Gorer Department of Immunobiology, King's College London, London, United Kingdom; Research Group of Molecular Immunology, Francis Crick Institute, London, United Kingdom.
Mol Cell Proteomics. 2025 Jul 21;24(9):101039. doi: 10.1016/j.mcpro.2025.101039.
Antigenic noncanonical epitope and novel protein discovery are research areas with therapeutical applications, predominantly done via mass spectrometry. The latter should rely on a well-characterized proteogenomic search space. Its size is barely known for antigenic noncanonical peptides and novel proteins, and this could impact their identification. To address these issues, we here develop an automated workflow comprised of Sequoia for the creation of RNA sequencing-informed and exhaustive sequence search spaces for various noncanonical peptide origins, and SPIsnake for pre-filtering and exploration of sequence search space before mass spectrometry searches. We apply our workflow to characterize the exact sizes of tryptic and nonspecific peptide sequence search spaces in a variety of definitions, their reduction when using RNA expression, their inflation by post-translational modifications, and the frequency of peptide sequence multimapping to different noncanonical origins. Furthermore, we explore the application of Sequoia and SPIsnake on HLA-I immunopeptidomes, thereby rescuing sensitivity in peptide identification when confronted with inflated search spaces. Taken together, Sequoia and SPIsnake pave the way for an educated development of methods addressing large-scale exhaustive proteogenomic discovery by exposing the consequences of database size inflation and ambiguity of peptide and protein sequence identification.
抗原性非经典表位和新型蛋白质发现是具有治疗应用的研究领域,主要通过质谱法进行。后者应依赖于特征明确的蛋白质基因组搜索空间。对于抗原性非经典肽和新型蛋白质,其大小几乎未知,这可能会影响它们的鉴定。为了解决这些问题,我们在此开发了一种自动化工作流程,该流程由Sequoia组成,用于为各种非经典肽来源创建基于RNA测序的详尽序列搜索空间,以及由SPIsnake组成,用于在质谱搜索之前对序列搜索空间进行预过滤和探索。我们应用我们的工作流程来表征各种定义下胰蛋白酶和非特异性肽序列搜索空间的精确大小、使用RNA表达时它们的缩减、翻译后修饰导致的膨胀以及肽序列多映射到不同非经典来源的频率。此外,我们探索了Sequoia和SPIsnake在HLA-I免疫肽组上的应用,从而在面对膨胀的搜索空间时挽救肽鉴定的灵敏度。总之,Sequoia和SPIsnake通过揭示数据库大小膨胀以及肽和蛋白质序列鉴定的模糊性所带来的后果,为有针对性地开发解决大规模详尽蛋白质基因组发现的方法铺平了道路。