Center for Biomedical Computing, Korea Institute of Science and Technology Information, Daejeon, Republic of Korea.
PLoS Comput Biol. 2024 Feb 28;20(2):e1011892. doi: 10.1371/journal.pcbi.1011892. eCollection 2024 Feb.
In proteomics, a crucial aspect is to identify peptide sequences. De novo sequencing methods have been widely employed to identify peptide sequences, and numerous tools have been proposed over the past two decades. Recently, deep learning approaches have been introduced for de novo sequencing. Previous methods focused on encoding tandem mass spectra and predicting peptide sequences from the first amino acid onwards. However, when predicting peptides using tandem mass spectra, the peptide sequence can be predicted not only from the first amino acid but also from the last amino acid due to the coexistence of b-ion (or a- or c-ion) and y-ion (or x- or z-ion) fragments in the tandem mass spectra. Therefore, it is essential to predict peptide sequences bidirectionally. Our approach, called NovoB, utilizes a Transformer model to predict peptide sequences bidirectionally, starting with both the first and last amino acids. In comparison to Casanovo, our method achieved an improvement of the average peptide-level accuracy rate of approximately 9.8% across all species.
在蛋白质组学中,一个关键的方面是识别肽序列。从头测序方法已被广泛用于识别肽序列,在过去的二十年中提出了许多工具。最近,深度学习方法已被引入从头测序。以前的方法侧重于对串联质谱进行编码,并从第一个氨基酸开始预测肽序列。然而,在用串联质谱预测肽时,由于串联质谱中存在 b 离子(或 a 或 c 离子)和 y 离子(或 x 或 z 离子)片段,因此不仅可以从第一个氨基酸,也可以从最后一个氨基酸预测肽序列。因此,双向预测肽序列是至关重要的。我们的方法称为 NovoB,利用 Transformer 模型从第一个和最后一个氨基酸开始双向预测肽序列。与 Casanovo 相比,我们的方法在所有物种上的平均肽级准确率提高了约 9.8%。