Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States.
Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States.
ACS Synth Biol. 2023 Dec 15;12(12):3544-3561. doi: 10.1021/acssynbio.3c00261. Epub 2023 Nov 21.
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
深度生成模型(DGM)在理解和数据驱动的蛋白质设计方面取得了巨大成功。变分自编码器(VAE)是一种流行的 DGM 方法,它可以学习蛋白质序列的多序列比对(MSA)中氨基酸突变的相关模式,并将这些信息提炼到低维潜在空间中,以揭示系统发育和功能关系,并指导生成蛋白质设计。自回归(AR)模型是另一种流行的 DGM 方法,通常缺乏低维潜在嵌入,但不需要将训练序列对齐到 MSA 中,并能够设计可变长度的蛋白质。在这项工作中,我们提出了 ProtWave-VAE,这是一种新颖的轻量级 DGM,采用具有扩张卷积编码器和自回归 WaveNet 解码器的信息最大化 VAE。这种架构融合了 VAE 和 AR 范式的优势,能够在不对齐的序列数据上进行训练,并从可解释的低维学习潜在空间中对可变长度序列进行条件生成设计。我们评估了该模型在推断无对齐同源蛋白质家族序列中的模式和设计规则以及在四个不同蛋白质家族中设计新的合成蛋白质的能力。我们表明,我们的模型可以在潜在空间中推断出有意义的功能和系统发育嵌入,并在半监督的下游适应性预测任务中进行高度准确的预测。在对酿酒酵母 Sho1 跨膜渗透压感受器中的 C 端 SH3 结构域的应用中,我们对 ProtWave-VAE 设计的序列进行了实验基因合成和选择-seq 测定,以评估渗透压功能,结果表明该模型能够进行合成蛋白质设计、条件 C 端多样化以及对 SH3 同源物的渗透压功能进行工程改造。