Bioinformatics Center, Kyoto University, Uji, Kyoto 611-0011, Japan.
Brief Bioinform. 2012 May;13(3):337-49. doi: 10.1093/bib/bbr059. Epub 2011 Dec 2.
A fundamental component of systems biology, proteolytic cleavage is involved in nearly all aspects of cellular activities: from gene regulation to cell lifecycle regulation. Current sequencing technologies have made it possible to compile large amount of cleavage data and brought greater understanding of the underlying protein interactions. However, the practical impossibility to exhaustively retrieve substrate sequences through experimentation alone has long highlighted the need for efficient computational prediction methods. Such methods must be able to quickly mark substrate candidates and putative cleavage sites for further analysis. Available methods and expected reliability depend heavily on the type and complexity of proteolytic action, as well as the availability of well-labelled experimental data sets: factors varying greatly across enzyme families. For this review, we chose to give a quick overview of the general issues and challenges in cleavage prediction methods followed by a more in-depth presentation of major techniques and implementations, with a focus on two particular families of cysteine proteases: caspases and calpains. Through their respective differences in proteolytic specificity (high for caspases, broader for calpains) and data availability (much lower for calpains), we aimed to illustrate the strengths and limitations of techniques ranging from position-based matrices and decision trees to more flexible machine-learning methods such as hidden Markov models and Support Vector Machines. In addition to a technical overview for each family of algorithms, we tried to provide elements of evaluation and performance comparison across methods.
蛋白质水解是系统生物学的一个基本组成部分,几乎涉及细胞活动的所有方面:从基因调控到细胞生命周期调控。目前的测序技术已经使得能够编译大量的切割数据,并对潜在的蛋白质相互作用有了更深入的了解。然而,仅通过实验穷尽地检索底物序列在实践中是不可能的,这长期以来一直强调需要有效的计算预测方法。这些方法必须能够快速标记底物候选物和假定的切割位点,以便进一步分析。可用的方法和预期的可靠性在很大程度上取决于蛋白质水解作用的类型和复杂性,以及标记良好的实验数据集的可用性:这些因素在酶家族之间有很大的差异。在这篇综述中,我们选择快速概述切割预测方法中的一般问题和挑战,然后更深入地介绍主要技术和实现,重点介绍两种特定的半胱氨酸蛋白酶家族:胱天蛋白酶和钙蛋白酶。通过它们在蛋白水解特异性(胱天蛋白酶高,钙蛋白酶宽)和数据可用性(钙蛋白酶低得多)方面的差异,我们旨在说明从基于位置的矩阵和决策树到更灵活的机器学习方法(如隐马尔可夫模型和支持向量机)等技术的优缺点。除了对每种算法家族进行技术概述外,我们还尝试提供跨方法的评估和性能比较的元素。