Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
Talus Bioscience, Seattle, USA.
Nat Commun. 2024 Jul 30;15(1):6427. doi: 10.1038/s41467-024-49731-x.
A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information-de novo peptide sequencing-is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo's superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome.
基于质谱的蛋白质组学的一个基本挑战是鉴定产生每个获得的串联质谱的肽。利用已知肽序列数据库的方法无法检测到意外的肽,并且在某些情况下可能不切实际或不可能应用。因此,在没有先验信息的情况下将肽序列分配给串联质谱 - 从头测序肽 - 对于包括抗体测序、免疫肽组学和宏蛋白质组学在内的任务是有价值的。尽管已经开发了许多方法来解决这个问题,但由于串联质谱不规则数据结构的建模难度,它仍然是一个悬而未决的挑战。在这里,我们描述了 Casanovo,这是一种机器学习模型,它使用转换器神经网络架构将串联质谱中的峰序列转换为生成肽所包含的氨基酸序列。我们从 3000 万个标记的光谱中训练了一个 Casanovo 模型,并证明该模型在跨物种基准数据集上优于几种最先进的方法。我们还开发了一种针对非酶肽的 Casanovo 版本。最后,我们证明了 Casanovo 的卓越性能可改善免疫肽组学和宏蛋白质组学实验的分析,并使我们能够更深入地研究黑暗蛋白质组。