Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501 Japan.
Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan.
J Chem Inf Model. 2020 Mar 23;60(3):1833-1843. doi: 10.1021/acs.jcim.9b00877. Epub 2020 Feb 27.
Unannotated gene sequences in databases are increasing due to sequencing advances. Therefore, computational methods to predict functions of unannotated genes are needed. Moreover, novel enzyme discovery for metabolic engineering applications further encourages annotation of sequences. Here, enzyme functions are predicted using two general approaches, each including several machine learning algorithms. First, Enzyme-models (E-models) predict Enzyme Commission (EC) numbers from amino acid sequence information. Second, Substrate-Enzyme models (SE-models) are built to predict substrates of enzymatic reactions together with EC numbers, and Substrate-Enzyme-Product models (SEP-models) are built to predict substrates, products, and EC numbers. While accuracy of E-models is not optimal, SE-models and SEP-models predict EC numbers and reactions with high accuracy using all tested machine learning-based methods. For example, a single Random Forests-based SEP-model predicts EC first digits with an Average AUC score of over 0.94. Various metrics indicate that the current strategy of combining sequence and chemical structure information is effective at improving enzyme reaction prediction.
由于测序技术的进步,数据库中未注释的基因序列不断增加。因此,需要开发计算方法来预测未注释基因的功能。此外,新型酶的发现也促进了代谢工程应用中的序列注释。在这里,使用两种通用方法预测酶的功能,每种方法都包含几种机器学习算法。首先,酶模型 (E-model) 根据氨基酸序列信息预测酶委员会 (EC) 编号。其次,构建底物-酶模型 (SE-model) 来预测酶反应的底物以及 EC 编号,并且构建底物-酶-产物模型 (SEP-model) 来预测底物、产物和 EC 编号。虽然 E-model 的准确性不是最佳的,但 SE-model 和 SEP-model 使用所有测试的基于机器学习的方法以高精度预测 EC 编号和反应。例如,单个基于随机森林的 SEP-model 预测 EC 前几位的平均 AUC 得分超过 0.94。各种指标表明,结合序列和化学结构信息的当前策略在提高酶反应预测方面是有效的。