School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom.
School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom.
Nat Commun. 2024 Oct 29;15(1):9309. doi: 10.1038/s41467-024-53622-6.
Engineering proteins with desired functions and biochemical properties is pivotal for biotechnology and drug discovery. While computational methods based on evolutionary information are reducing the experimental burden by designing targeted libraries of functional variants, they still have a low success rate when the desired protein has few or very remote homologous sequences. Here we propose an autoregressive model, called Temporal Dirichlet Variational Autoencoder (TDVAE), which exploits the mathematical properties of the Dirichlet distribution and temporal convolution to efficiently learn high-order information from a functionally related, possibly remotely similar, set of sequences. TDVAE is highly accurate in predicting the effects of amino acid mutations, while being significantly 90% smaller than the other state-of-the-art models. We then use TDVAE to design variants of the human alpha galactosidase enzymes as potential treatment for Fabry disease. Our model builds a library of diverse variants which retain sequence, biochemical and structural properties of the wildtype protein, suggesting they could be suitable for enzyme replacement therapy. Taken together, our results show the importance of accurate sequence modelling and the potential of autoregressive models as protein engineering and analysis tools.
工程蛋白质具有所需的功能和生化特性对于生物技术和药物发现至关重要。虽然基于进化信息的计算方法通过设计靶向功能变体文库来减轻实验负担,但当所需蛋白质的同源序列很少或非常远程时,它们的成功率仍然很低。在这里,我们提出了一种自回归模型,称为时间狄利克雷变分自动编码器(TDVAE),它利用狄利克雷分布和时间卷积的数学性质从一组功能相关的、可能远程相似的序列中有效地学习高阶信息。TDVAE 在预测氨基酸突变的影响方面非常准确,同时比其他最先进的模型小 90%。然后,我们使用 TDVAE 设计人类α半乳糖苷酶的变体,作为法布里病的潜在治疗方法。我们的模型构建了一个多样化变体的文库,保留了野生型蛋白质的序列、生化和结构特性,这表明它们可能适合酶替代疗法。总之,我们的研究结果表明准确的序列建模的重要性和自回归模型作为蛋白质工程和分析工具的潜力。