Suppr超能文献

机器学习揭示了家族 7 糖苷水解酶的序列-功能关系。

Machine learning reveals sequence-function relationships in family 7 glycoside hydrolases.

机构信息

Department of Chemical and Materials Engineering, University of Kentucky, Lexington, Kentucky, USA; Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, Colorado, USA.

Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA.

出版信息

J Biol Chem. 2021 Aug;297(2):100931. doi: 10.1016/j.jbc.2021.100931. Epub 2021 Jul 1.

Abstract

Family 7 glycoside hydrolases (GH7) are among the principal enzymes for cellulose degradation in nature and industrially. These enzymes are often bimodular, including a catalytic domain and carbohydrate-binding module (CBM) attached via a flexible linker, and exhibit an active site that binds cello-oligomers of up to ten glucosyl moieties. GH7 cellulases consist of two major subtypes: cellobiohydrolases (CBH) and endoglucanases (EG). Despite the critical importance of GH7 enzymes, there remain gaps in our understanding of how GH7 sequence and structure relate to function. Here, we employed machine learning to gain data-driven insights into relationships between sequence, structure, and function across the GH7 family. Machine-learning models, trained only on the number of residues in the active-site loops as features, were able to discriminate GH7 CBHs and EGs with up to 99% accuracy, demonstrating that the lengths of loops A4, B2, B3, and B4 strongly correlate with functional subtype across the GH7 family. Classification rules were derived such that specific residues at 42 different sequence positions each predicted the functional subtype with accuracies surpassing 87%. A random forest model trained on residues at 19 positions in the catalytic domain predicted the presence of a CBM with 89.5% accuracy. Our machine learning results recapitulate, as top-performing features, a substantial number of the sequence positions determined by previous experimental studies to play vital roles in GH7 activity. We surmise that the yet-to-be-explored sequence positions among the top-performing features also contribute to GH7 functional variation and may be exploited to understand and manipulate function.

摘要

家族 7 糖苷水解酶 (GH7) 是自然界和工业中纤维素降解的主要酶类之一。这些酶通常为双模块结构,包括一个催化结构域和通过柔性接头连接的碳水化合物结合模块 (CBM),并具有一个结合多达十个葡萄糖基单元的纤维寡聚物的活性位点。GH7 纤维素酶由两个主要亚型组成:纤维二糖水解酶 (CBH) 和内切葡聚糖酶 (EG)。尽管 GH7 酶具有至关重要的作用,但我们对 GH7 序列和结构与功能的关系仍存在理解上的差距。在这里,我们采用机器学习方法,深入了解 GH7 家族中序列、结构和功能之间的关系。仅使用活性位点环中的残基数作为特征训练的机器学习模型,能够以高达 99%的准确率区分 GH7 CBH 和 EG,表明 A4、B2、B3 和 B4 环的长度与 GH7 家族中功能亚型强烈相关。得出了分类规则,即 42 个不同序列位置的特定残基每个都能以超过 87%的准确率预测功能亚型。在催化结构域的 19 个位置上训练的随机森林模型以 89.5%的准确率预测 CBM 的存在。我们的机器学习结果再现了许多之前的实验研究确定的对 GH7 活性起着重要作用的序列位置,这些位置是作为表现最佳的特征出现的。我们推测,表现最佳的特征中尚未探索的序列位置也可能对 GH7 的功能变异有贡献,并可用于理解和操纵功能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ff91/8329511/e7e59ed2e0d8/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验