Biomedical Center, Protein Analysis Unit, Faculty of Medicine, Ludwig-Maximilians-Universität München, Großhaderner Strasse 9, 82152 Planegg-Martinsried, Germany.
Institute of Stem Cell Research, Helmholtz Center Munich, German Research Center for Environmental Health, 85764 Munich, Germany.
J Proteome Res. 2021 Jul 2;20(7):3749-3757. doi: 10.1021/acs.jproteome.1c00346. Epub 2021 Jun 17.
Trypsin is one of the most important and widely used proteolytic enzymes in mass spectrometry (MS)-based proteomic research. It exclusively cleaves peptide bonds at the C-terminus of lysine and arginine. However, the cleavage is also affected by several factors, including specific surrounding amino acids, resulting in frequent incomplete proteolysis and subsequent issues in peptide identification and quantification. The accurate annotations on missed cleavages are crucial to database searching in MS analysis. Here, we present deep-learning predicting missed cleavages (dpMC), a novel algorithm for the prediction of missed trypsin cleavage sites. This algorithm provides a very high accuracy for predicting missed cleavages with area under the curves (AUCs) of cross-validation and holdout testing above 0.99, along with the mean F1 score and the Matthews correlation coefficient (MCC) of 0.9677 and 0.9349, respectively. We tested our algorithm on data sets from different species and different experimental conditions, and its performance outperforms other currently available prediction methods. In addition, the method also provides a better insight into the detailed rules of trypsin cleavages coupled with propensity and motif analysis. Moreover, our method can be integrated into database searching in the MS analysis to identify and quantify mass spectra effectively and efficiently.
胰蛋白酶是基于质谱(MS)的蛋白质组学研究中最重要和最广泛使用的蛋白水解酶之一。它专门在赖氨酸和精氨酸的 C 末端切割肽键。然而,这种切割也受到许多因素的影响,包括特定的周围氨基酸,导致频繁出现不完全的蛋白水解,从而影响肽的鉴定和定量。在 MS 分析的数据库搜索中,准确注释缺失的切割至关重要。在这里,我们提出了深度学习预测缺失切割(dpMC),这是一种用于预测胰蛋白酶缺失切割位点的新算法。该算法在交叉验证和保留测试中的曲线下面积(AUC)均高于 0.99,平均 F1 分数和马修斯相关系数(MCC)分别为 0.9677 和 0.9349,对预测缺失切割具有很高的准确性。我们在来自不同物种和不同实验条件的数据集中测试了我们的算法,其性能优于其他现有的预测方法。此外,该方法还结合倾向和模体分析,深入了解胰蛋白酶切割的详细规则。此外,我们的方法可以集成到 MS 分析中的数据库搜索中,以有效和高效地识别和定量质谱。