Clark Joseph D, Mi Xuenan, Mitchell Douglas A, Shukla Diwakar
School of Molecular and Cellular Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign Urbana IL 61801 USA.
Digit Discov. 2024 Dec 2;4(2):343-354. doi: 10.1039/d4dd00170b. eCollection 2025 Feb 12.
Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.
核糖体合成及翻译后修饰肽(RiPP)生物合成酶通常表现出混杂的底物偏好,无法简化为简单规则。大语言模型是预测RiPP生物合成酶特异性的有前景的工具。然而,当前最先进的蛋白质语言模型是基于相对较少的肽序列进行训练的。先前的一项研究全面分析了来自乳唑生物合成途径的LazBF(一种双组分丝氨酸脱水酶)和LazDEF(一种三组分唑合成酶)的肽底物偏好。我们证明,对LazBF底物偏好进行掩码语言建模产生的语言模型嵌入改善了对LazBF和LazDEF底物的下游预测。同样,对LazDEF底物偏好进行掩码语言建模产生的嵌入也改善了对LazBF和LazDEF底物的预测。我们的结果表明,这些模型学习到了可在同一生物合成途径中起作用的不同酶促转化之间转移的功能形式。我们发现,一个单一的、高质量的RiPP生物合成酶底物和非底物数据集在数据稀缺的情况下改善了对不同酶的底物预测。然后,我们在每个数据集上对模型进行了微调,并表明微调后的模型提供了可解释的见解,我们预计这将有助于设计与所需RiPP生物合成途径兼容的底物库。