Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky 40506, United States.
National Bioenergy Center, National Renewable Energy Laboratory, Golden, Colorado 80401, United States.
J Chem Inf Model. 2020 Aug 24;60(8):4098-4107. doi: 10.1021/acs.jcim.0c00489. Epub 2020 Jul 22.
Accurate prediction of the optimal catalytic temperature () of enzymes is vital in biotechnology, as enzymes with high values are desired for enhanced reaction rates. Recently, a machine learning method (temperature optima for microorganisms and enzymes, TOME) for predicting was developed. TOME was trained on a normally distributed data set with a median of 37 °C and less than 5% of values above 85 °C, limiting the method's predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on values greater than 85 °C is nearly an order of magnitude higher than the error on values between 30 and 50 °C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high values (>85 °C) by 60% and increase the overall value from 0.527 to 0.632. The revised method, temperature optima for enzymes with resampling (TOMER), and the resampling strategies applied in this work are freely available to other researchers as Python packages on GitHub.
准确预测酶的最佳催化温度()在生物技术中至关重要,因为需要具有高值的酶来提高反应速率。最近,开发了一种用于预测的机器学习方法(微生物和酶的温度最优,TOME)。TOME 是在具有中位数为 37°C 且值低于 85°C 的小于 5%的正态分布数据集上进行训练的,这限制了该方法对热稳定酶的预测能力。由于训练数据的分布,值大于 85°C 的均方误差几乎比值在 30 到 50°C 之间的误差高一个数量级。在这项研究中,我们应用了处理数据不平衡的集成学习和重采样策略,将高值(>85°C)的误差显著降低了 60%,并将整体值从 0.527 提高到 0.632。修订后的方法、具有重采样的酶的最优温度(TOMER)以及本工作中应用的重采样策略作为 Python 包在 GitHub 上免费提供给其他研究人员。