Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
Nat Commun. 2023 Apr 19;14(1):2222. doi: 10.1038/s41467-023-37958-z.
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
变分自动编码器是具有生成能力的无监督学习模型,当应用于蛋白质数据时,它们可以根据系统发育对序列进行分类,并生成保留蛋白质组成统计特性的从头序列。虽然以前的研究集中在聚类和生成特性上,但在这里,我们评估了序列信息所嵌入的潜在流形。为了研究潜在流形的特性,我们利用直接耦合分析和 Potts 哈密顿模型来构建潜在的生成景观。我们展示了这个景观如何捕捉几个系统的系统发育分组、功能和适应性特征,包括球蛋白、β-内酰胺酶、离子通道和转录因子。我们提供了关于景观如何帮助我们理解实验数据中观察到的序列可变性的影响,并提供了关于定向和自然蛋白质进化的见解的信息。我们提出,将变分自动编码器的生成特性和功能预测能力与共进化分析相结合,可能有助于蛋白质工程和设计的应用。