Bioinformatics Center, Kyoto University, Uji, Kyoto, Japan.
PLoS One. 2011 May 3;6(5):e19035. doi: 10.1371/journal.pone.0019035.
Calpain, an intracellular Ca²⁺-dependent cysteine protease, is known to play a role in a wide range of metabolic pathways through limited proteolysis of its substrates. However, only a limited number of these substrates are currently known, with the exact mechanism of substrate recognition and cleavage by calpain still largely unknown. While previous research has successfully applied standard machine-learning algorithms to accurately predict substrate cleavage by other similar types of proteases, their approach does not extend well to calpain, possibly due to its particular mode of proteolytic action and limited amount of experimental data. Through the use of Multiple Kernel Learning, a recent extension to the classic Support Vector Machine framework, we were able to train complex models based on rich, heterogeneous feature sets, leading to significantly improved prediction quality (6% over highest AUC score produced by state-of-the-art methods). In addition to producing a stronger machine-learning model for the prediction of calpain cleavage, we were able to highlight the importance and role of each feature of substrate sequences in defining specificity: primary sequence, secondary structure and solvent accessibility. Most notably, we showed there existed significant specificity differences across calpain sub-types, despite previous assumption to the contrary. Prediction accuracy was further successfully validated using, as an unbiased test set, mutated sequences of calpastatin (endogenous inhibitor of calpain) modified to no longer block calpain's proteolytic action. An online implementation of our prediction tool is available at http://calpain.org.
钙蛋白酶是一种细胞内依赖 Ca²⁺的半胱氨酸蛋白酶,已知通过对其底物的有限水解在广泛的代谢途径中发挥作用。然而,目前仅知道少数这些底物,钙蛋白酶对底物的识别和切割的确切机制仍在很大程度上未知。虽然之前的研究已经成功地应用了标准的机器学习算法来准确预测其他类似类型蛋白酶的底物切割,但它们的方法不适用于钙蛋白酶,可能是由于其特定的蛋白水解作用模式和有限的实验数据。通过使用多核学习,这是经典支持向量机框架的一个最新扩展,我们能够基于丰富、异构的特征集训练复杂的模型,从而显著提高预测质量(比最先进方法产生的最高 AUC 分数高出 6%)。除了为钙蛋白酶切割预测生成更强的机器学习模型外,我们还能够突出底物序列中每个特征在定义特异性方面的重要性和作用:一级序列、二级结构和溶剂可及性。值得注意的是,尽管之前有相反的假设,但我们表明钙蛋白酶亚类之间存在显著的特异性差异。我们还使用钙蛋白酶抑制剂钙蛋白酶抑制素(calpastatin)的突变序列作为无偏测试集,这些突变序列经过修饰后不再阻断钙蛋白酶的蛋白水解作用,成功地验证了预测准确性。我们的预测工具的在线实现可在 http://calpain.org 获得。