ProtWave-VAE：用于数据驱动蛋白质设计的基于潜在信息的推断与自回归采样的整合。

ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design.

机构信息

Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States.

Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States.

出版信息

ACS Synth Biol. 2023 Dec 15;12(12):3544-3561. doi: 10.1021/acssynbio.3c00261. Epub 2023 Nov 21.

DOI:10.1021/acssynbio.3c00261

PMID:37988083

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10911954/

Abstract

Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.

摘要

深度生成模型（DGM）在理解和数据驱动的蛋白质设计方面取得了巨大成功。变分自编码器（VAE）是一种流行的 DGM 方法，它可以学习蛋白质序列的多序列比对（MSA）中氨基酸突变的相关模式，并将这些信息提炼到低维潜在空间中，以揭示系统发育和功能关系，并指导生成蛋白质设计。自回归（AR）模型是另一种流行的 DGM 方法，通常缺乏低维潜在嵌入，但不需要将训练序列对齐到 MSA 中，并能够设计可变长度的蛋白质。在这项工作中，我们提出了 ProtWave-VAE，这是一种新颖的轻量级 DGM，采用具有扩张卷积编码器和自回归 WaveNet 解码器的信息最大化 VAE。这种架构融合了 VAE 和 AR 范式的优势，能够在不对齐的序列数据上进行训练，并从可解释的低维学习潜在空间中对可变长度序列进行条件生成设计。我们评估了该模型在推断无对齐同源蛋白质家族序列中的模式和设计规则以及在四个不同蛋白质家族中设计新的合成蛋白质的能力。我们表明，我们的模型可以在潜在空间中推断出有意义的功能和系统发育嵌入，并在半监督的下游适应性预测任务中进行高度准确的预测。在对酿酒酵母 Sho1 跨膜渗透压感受器中的 C 端 SH3 结构域的应用中，我们对 ProtWave-VAE 设计的序列进行了实验基因合成和选择-seq 测定，以评估渗透压功能，结果表明该模型能够进行合成蛋白质设计、条件 C 端多样化以及对 SH3 同源物的渗透压功能进行工程改造。

相似文献

ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design.ProtWave-VAE：用于数据驱动蛋白质设计的基于潜在信息的推断与自回归采样的整合。

ACS Synth Biol. 2023 Dec 15;12(12):3544-3561. doi: 10.1021/acssynbio.3c00261. Epub 2023 Nov 21.

Searching for protein variants with desired properties using deep generative models.使用深度生成模型搜索具有所需特性的蛋白质变体。

BMC Bioinformatics. 2023 Jul 21;24(1):297. doi: 10.1186/s12859-023-05415-9.

Generating functional protein variants with variational autoencoders.利用变分自动编码器生成功能性蛋白质变体。

PLoS Comput Biol. 2021 Feb 26;17(2):e1008736. doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.

Deep-learning-based design of synthetic orthologs of SH3 signaling domains.基于深度学习的 SH3 信号结构域合成同源物的设计。

Cell Syst. 2024 Aug 21;15(8):725-737.e7. doi: 10.1016/j.cels.2024.07.005. Epub 2024 Aug 5.

An Overview of Variational Autoencoders for Source Separation, Finance, and Bio-Signal Applications.用于源分离、金融和生物信号应用的变分自编码器概述。

Entropy (Basel). 2021 Dec 28;24(1):55. doi: 10.3390/e24010055.

Supervising the Decoder of Variational Autoencoders to Improve Scientific Utility.监督变分自编码器的解码器以提高科学效用。

IEEE Trans Signal Process. 2022;70:5954-5966. doi: 10.1109/tsp.2022.3230329. Epub 2022 Dec 19.

Deep Mixture Generative Autoencoders.深度混合生成自编码器

IEEE Trans Neural Netw Learn Syst. 2022 Oct;33(10):5789-5803. doi: 10.1109/TNNLS.2021.3071401. Epub 2022 Oct 5.

Generative models for protein sequence modeling: recent advances and future directions.蛋白质序列建模的生成模型：最新进展和未来方向。

Brief Bioinform. 2023 Sep 22;24(6). doi: 10.1093/bib/bbad358.

Prediction of mutation effects using a deep temporal convolutional network.使用深度时间卷积网络预测突变效应。

Bioinformatics. 2020 Apr 1;36(7):2047-2052. doi: 10.1093/bioinformatics/btz873.

Predicting drug polypharmacology from cell morphology readouts using variational autoencoder latent space arithmetic.基于变分自动编码器潜在空间算法从细胞形态读取结果预测药物多效性。

PLoS Comput Biol. 2022 Feb 25;18(2):e1009888. doi: 10.1371/journal.pcbi.1009888. eCollection 2022 Feb.

引用本文的文献

Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering.机器学习辅助酶工程面临的机遇与挑战

ACS Cent Sci. 2024 Feb 5;10(2):226-241. doi: 10.1021/acscentsci.3c01275. eCollection 2024 Feb 28.

本文引用的文献

Deep-learning-based design of synthetic orthologs of SH3 signaling domains.基于深度学习的 SH3 信号结构域合成同源物的设计。

Cell Syst. 2024 Aug 21;15(8):725-737.e7. doi: 10.1016/j.cels.2024.07.005. Epub 2024 Aug 5.

Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。

Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.

Using AlphaFold to predict the impact of single mutations on protein stability and function.利用 AlphaFold 预测单突变对蛋白质稳定性和功能的影响。

PLoS One. 2023 Mar 16;18(3):e0282689. doi: 10.1371/journal.pone.0282689. eCollection 2023.

Large language models generate functional protein sequences across diverse families.大型语言模型可生成不同家族的功能性蛋白质序列。

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

ProtGPT2 is a deep unsupervised language model for protein design.ProtGPT2 是一个用于蛋白质设计的深度无监督语言模型。

Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.

ColabFold: making protein folding accessible to all.ColabFold：让蛋白质折叠变得人人可用。

Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.

100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design.高分子科学观点100周年：数据驱动的蛋白质设计

ACS Macro Lett. 2021 Mar 16;10(3):327-340. doi: 10.1021/acsmacrolett.0c00885. Epub 2021 Feb 8.

Machine learning to navigate fitness landscapes for protein engineering.机器学习在蛋白质工程中的应用：探索适应度景观

Curr Opin Biotechnol. 2022 Jun;75:102713. doi: 10.1016/j.copbio.2022.102713. Epub 2022 Apr 9.

Protein design via deep learning.通过深度学习进行蛋白质设计。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac102.

Therapeutic enzyme engineering using a generative neural network.使用生成式神经网络进行治疗性酶工程。

Sci Rep. 2022 Jan 27;12(1):1536. doi: 10.1038/s41598-022-05195-x.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验