Division of Infection Medicine, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden.
Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France.
PLoS Comput Biol. 2023 Jan 20;19(1):e1010457. doi: 10.1371/journal.pcbi.1010457. eCollection 2023 Jan.
Generating and analyzing overlapping peptides through multienzymatic digestion is an efficient procedure for de novo protein using from bottom-up mass spectrometry (MS). Despite improved instrumentation and software, de novo MS data analysis remains challenging. In recent years, deep learning models have represented a performance breakthrough. Incorporating that technology into de novo protein sequencing workflows require machine-learning models capable of handling highly diverse MS data. In this study, we analyzed the requirements for assembling such generalizable deep learning models by systemcally varying the composition and size of the training set. We assessed the generated models' performances using two test sets composed of peptides originating from the multienzyme digestion of samples from various species. The peptide recall values on the test sets showed that the deep learning models generated from a collection of highly N- and C-termini diverse peptides generalized 76% more over the termini-restricted ones. Moreover, expanding the training set's size by adding peptides from the multienzymatic digestion with five proteases of several species samples led to a 2-3 fold generalizability gain. Furthermore, we tested the applicability of these multienzyme deep learning (MEM) models by fully de novo sequencing the heavy and light monomeric chains of five commercial antibodies (mAbs). MEMs extracted over 10000 matching and overlapped peptides across six different proteases mAb samples, achieving a 100% sequence coverage for 8 of the ten polypeptide chains. We foretell that the MEMs' proven improvements to de novo analysis will positively impact several applications, such as analyzing samples of high complexity, unknown nature, or the peptidomics field.
通过多酶消化生成和分析重叠肽是一种从下至上的质谱(MS)从头蛋白质分析的有效方法。尽管仪器和软件得到了改进,从头 MS 数据分析仍然具有挑战性。近年来,深度学习模型取得了性能突破。将该技术纳入从头蛋白质测序工作流程需要能够处理高度多样化 MS 数据的机器学习模型。在这项研究中,我们通过系统地改变训练集的组成和大小来分析组装此类可推广的深度学习模型的要求。我们使用由来自不同物种的多酶消化样品的肽组成的两个测试集来评估生成模型的性能。测试集上的肽召回值表明,从高度 N 端和 C 端多样化肽集合中生成的深度学习模型比受限于 N 端和 C 端的模型具有 76%的更高通用性。此外,通过添加来自多种物种样品的五蛋白酶多酶消化的肽来扩大训练集的大小,可获得 2-3 倍的通用性增益。此外,我们通过对五个商业抗体(mAb)的重链和轻单体链进行完全从头测序来测试这些多酶深度学习(MEM)模型的适用性。MEM 在六个不同蛋白酶 mAb 样品中提取了超过 10000 个匹配和重叠的肽,实现了十个多肽链中的 8 个 100%的序列覆盖率。我们预测,MEM 对从头分析的改进将对几个应用产生积极影响,例如分析高复杂性、未知性质或肽组学领域的样品。