Skolkovo Institute of Science and Technology, Moscow 121205, Russia.
A.A. Kharkevich Institute for Information Transmission Problems, Moscow 127051, Russia.
Int J Mol Sci. 2023 Jun 28;24(13):10761. doi: 10.3390/ijms241310761.
The importance of 3D protein structure in proteolytic processing is well known. However, despite the plethora of existing methods for predicting proteolytic sites, only a few of them utilize the structural features of potential substrates as predictors. Moreover, to our knowledge, there is currently no method available for predicting the structural susceptibility of protein regions to proteolysis. We developed such a method using data from CutDB, a database that contains experimentally verified proteolytic events. For prediction, we utilized structural features that have been shown to influence proteolysis in earlier studies, such as solvent accessibility, secondary structure, and temperature factor. Additionally, we introduced new structural features, including length of protruded loops and flexibility of protein termini. To maximize the prediction quality of the method, we carefully curated the training set, selected an appropriate machine learning method, and sampled negative examples to determine the optimal positive-to-negative class size ratio. We demonstrated that combining our method with models of protease primary specificity can outperform existing bioinformatics methods for the prediction of proteolytic sites. We also discussed the possibility of utilizing this method for bioinformatics prediction of other post-translational modifications.
三维蛋白质结构在蛋白水解加工中的重要性是众所周知的。然而,尽管有大量现有的预测蛋白水解位点的方法,但只有少数方法将潜在底物的结构特征用作预测因子。此外,据我们所知,目前还没有方法可用于预测蛋白质区域对蛋白水解的结构易感性。我们使用 CutDB 数据库中的实验验证的蛋白水解事件数据开发了这样一种方法。对于预测,我们利用了在早期研究中已被证明会影响蛋白水解的结构特征,如溶剂可及性、二级结构和温度因子。此外,我们引入了新的结构特征,包括突出环的长度和蛋白末端的柔韧性。为了最大限度地提高该方法的预测质量,我们仔细编辑了训练集,选择了合适的机器学习方法,并对负例进行了采样,以确定最佳的正例与负例类别大小比。我们证明,将我们的方法与蛋白酶主要特异性模型结合使用,可以优于现有的生物信息学方法,用于预测蛋白水解位点。我们还讨论了利用该方法进行其他翻译后修饰的生物信息学预测的可能性。