Suppr超能文献

通过掩码语言建模和迁移学习对核糖体合成和翻译后修饰肽生物合成酶的底物预测

Substrate Prediction for RiPP Biosynthetic Enzymes via Masked Language Modeling and Transfer Learning.

作者信息

Clark Joseph D, Mi Xuenan, Mitchell Douglas A, Shukla Diwakar

机构信息

School of Molecular and Cellular Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.

Center for Biophysics and Quantitative Biology,University of Illinois at Urbana-Champaign,Urbana, IL 61801, USA.

出版信息

ArXiv. 2024 Feb 23:arXiv:2402.15181v1.

Abstract

Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting such peptide fitness landscapes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream classification models of both LazBF and LazDEF substrates. Similarly, masked language modelling of LazDEF substrate preferences produced embeddings that improved the performance of classification models of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. Our transfer learning method improved performance and data efficiency in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.

摘要

核糖体合成及翻译后修饰肽(RiPP)生物合成酶通常表现出混杂的底物偏好性,无法简化为简单规则。大语言模型是预测此类肽适应性景观的有前景的工具。然而,当前最先进的蛋白质语言模型是基于相对较少的肽序列进行训练的。先前的一项研究全面剖析了来自乳唑生物合成途径的LazBF(一种双组分丝氨酸脱水酶)和LazDEF(一种三组分唑合成酶)的肽底物偏好性。我们证明,对LazBF底物偏好性进行掩码语言建模产生的语言模型嵌入改善了LazBF和LazDEF底物的下游分类模型。同样,对LazDEF底物偏好性进行掩码语言建模产生的嵌入提高了LazBF和LazDEF底物分类模型的性能。我们的结果表明,这些模型学习到了可在同一生物合成途径中起作用的不同酶促转化之间转移的功能形式。我们的迁移学习方法在数据稀缺的情况下提高了性能和数据效率。然后,我们在每个数据集上对模型进行微调,并表明微调后的模型提供了可解释的见解,我们预计这将有助于设计与所需RiPP生物合成途径兼容的底物库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9ac7/10925380/447f0fbb5f28/nihpp-2402.15181v1-f0001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验