Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia; Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.
J Theor Biol. 2018 Apr 14;443:125-137. doi: 10.1016/j.jtbi.2018.01.023. Epub 2018 Feb 1.
Determining the catalytic residues in an enzyme is critical to our understanding the relationship between protein sequence, structure, function, and enhancing our ability to design novel enzymes and their inhibitors. Although many enzymes have been sequenced, and their primary and tertiary structures determined, experimental methods for enzyme functional characterization lag behind. Because experimental methods used for identifying catalytic residues are resource- and labor-intensive, computational approaches have considerable value and are highly desirable for their ability to complement experimental studies in identifying catalytic residues and helping to bridge the sequence-structure-function gap. In this study, we describe a new computational method called PREvaIL for predicting enzyme catalytic residues. This method was developed by leveraging a comprehensive set of informative features extracted from multiple levels, including sequence, structure, and residue-contact network, in a random forest machine-learning framework. Extensive benchmarking experiments on eight different datasets based on 10-fold cross-validation and independent tests, as well as side-by-side performance comparisons with seven modern sequence- and structure-based methods, showed that PREvaIL achieved competitive predictive performance, with an area under the receiver operating characteristic curve and area under the precision-recall curve ranging from 0.896 to 0.973 and from 0.294 to 0.523, respectively. We demonstrated that this method was able to capture useful signals arising from different levels, leveraging such differential but useful types of features and allowing us to significantly improve the performance of catalytic residue prediction. We believe that this new method can be utilized as a valuable tool for both understanding the complex sequence-structure-function relationships of proteins and facilitating the characterization of novel enzymes lacking functional annotations.
确定酶中的催化残基对于我们理解蛋白质序列、结构、功能之间的关系以及提高设计新型酶及其抑制剂的能力至关重要。尽管已经对许多酶进行了测序,并且确定了它们的一级和三级结构,但酶功能特征的实验方法却落后了。由于用于鉴定催化残基的实验方法资源和劳动力密集,因此计算方法具有很大的价值,并且非常适合通过补充实验研究来鉴定催化残基并帮助弥合序列-结构-功能差距。在这项研究中,我们描述了一种称为 PREvaIL 的新计算方法,用于预测酶的催化残基。该方法是通过在随机森林机器学习框架中利用从多个级别(包括序列、结构和残基接触网络)提取的全面信息特征来开发的。在基于 10 倍交叉验证和独立测试的八个不同数据集上进行了广泛的基准测试实验,以及与七种现代序列和结构基方法的并排性能比较,结果表明 PREvaIL 实现了具有竞争力的预测性能,接收器操作特征曲线下的面积和精度-召回曲线下的面积分别为 0.896 至 0.973 和 0.294 至 0.523。我们证明了该方法能够捕获来自不同层次的有用信号,利用这种不同但有用的特征类型,并使我们能够显著提高催化残基预测的性能。我们相信,这种新方法可以用作理解蛋白质复杂的序列-结构-功能关系以及促进缺乏功能注释的新型酶的特征描述的有价值的工具。