Beijing National Laboratory for Molecular Sciences, CAS Key Laboratory of Molecular Recognition and Function, Institute of Chemistry, Chinese Academy of Sciences, Beijing, China.
University of Chinese Academy of Sciences, Beijing, China.
Nat Commun. 2024 Oct 10;15(1):8778. doi: 10.1038/s41467-024-53048-0.
Biocatalysis is an attractive approach for the synthesis of chiral pharmaceuticals and fine chemicals, but assessing and/or improving the enantioselectivity of biocatalyst towards target substrates is often time and resource intensive. Although machine learning has been used to reveal the underlying relationship between protein sequences and biocatalytic enantioselectivity, the establishment of substrate fitness space is usually disregarded by chemists and is still a challenge. Using 240 datasets collected in our previous works, we adopt chemistry and geometry descriptors and build random forest classification models for predicting the enantioselectivity of amidase towards new substrates. We further propose a heuristic strategy based on these models, by which the rational protein engineering can be efficiently performed to synthesize chiral compounds with higher ee values, and the optimized variant results in a 53-fold higher E-value comparing to the wild-type amidase. This data-driven methodology is expected to broaden the application of machine learning in biocatalysis research.
生物催化是一种有吸引力的方法,可用于合成手性药物和精细化学品,但评估和/或提高生物催化剂对目标底物的对映选择性通常需要耗费大量的时间和资源。尽管机器学习已被用于揭示蛋白质序列与生物催化对映选择性之间的潜在关系,但化学家通常忽略了建立底物适应性空间,这仍然是一个挑战。使用我们之前工作中收集的 240 个数据集,我们采用化学和几何描述符,并构建随机森林分类模型,以预测酰胺酶对新底物的对映选择性。我们进一步提出了一种基于这些模型的启发式策略,通过该策略,可以有效地进行理性蛋白质工程,以合成具有更高 ee 值的手性化合物,并且优化的变体与野生型酰胺酶相比,E 值提高了 53 倍。这种数据驱动的方法有望拓宽机器学习在生物催化研究中的应用。