使用自回归生成模型进行蛋白质设计和变体预测。

Protein design and variant prediction using autoregressive generative models.

机构信息

Department of Systems Biology, Harvard Medical School, Boston, MA, USA.

insitro, South San Francisco, CA, USA.

出版信息

Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.

DOI:10.1038/s41467-021-22732-w

PMID:33893299

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8065141/

Abstract

The ability to design functional sequences and predict effects of variation is central to protein engineering and biotherapeutics. State-of-art computational methods rely on models that leverage evolutionary information but are inadequate for important applications where multiple sequence alignments are not robust. Such applications include the prediction of variant effects of indels, disordered proteins, and the design of proteins such as antibodies due to the highly variable complementarity determining regions. We introduce a deep generative model adapted from natural language processing for prediction and design of diverse functional sequences without the need for alignments. The model performs state-of-art prediction of missense and indel effects and we successfully design and test a diverse 10-nanobody library that shows better expression than a 1000-fold larger synthetic library. Our results demonstrate the power of the alignment-free autoregressive model in generalizing to regions of sequence space traditionally considered beyond the reach of prediction and design.

摘要

设计功能序列和预测变异影响的能力是蛋白质工程和生物治疗的核心。最先进的计算方法依赖于利用进化信息的模型，但对于某些重要应用来说并不足够，因为这些应用中多序列比对并不稳健。此类应用包括插入缺失、无序蛋白质的变异影响预测，以及由于高度可变的互补决定区而导致的抗体等蛋白质的设计。我们引入了一种源自自然语言处理的深度生成模型，用于在无需比对的情况下预测和设计多样化的功能序列。该模型在预测错义突变和插入缺失影响方面表现出色，我们成功设计并测试了一个多样化的 10 纳米抗体文库，其表达水平优于大 1000 倍的合成文库。我们的结果证明了无比对自回归模型在序列空间的传统认为难以预测和设计的区域进行泛化的能力。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/56f2/8065141/875eec8e7fdf/41467_2021_22732_Fig1_HTML.jpg

相似文献

Protein design and variant prediction using autoregressive generative models.使用自回归生成模型进行蛋白质设计和变体预测。

Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.

Can computationally designed protein sequences improve secondary structure prediction?计算设计的蛋白质序列能否提高二级结构预测？

Protein Eng Des Sel. 2011 May;24(5):455-61. doi: 10.1093/protein/gzr003. Epub 2011 Jan 31.

Accurate prediction for atomic-level protein design and its application in diversifying the near-optimal sequence space.原子水平蛋白质设计的准确预测及其在扩展近最优序列空间中的应用。

Proteins. 2009 May 15;75(3):682-705. doi: 10.1002/prot.22280.

Simplified synthetic antibody libraries.简化的合成抗体文库。

Methods Enzymol. 2012;502:3-23. doi: 10.1016/B978-0-12-416039-2.00001-X.

An integrative approach to protein sequence design through multiobjective optimization.通过多目标优化进行蛋白质序列设计的综合方法。

PLoS Comput Biol. 2024 Jul 11;20(7):e1011953. doi: 10.1371/journal.pcbi.1011953. eCollection 2024 Jul.

Direct prediction of profiles of sequences compatible with a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles.通过具有基于片段的局部特征和基于能量的非局部特征的神经网络直接预测与蛋白质结构兼容的序列特征。

Proteins. 2014 Oct;82(10):2565-73. doi: 10.1002/prot.24620. Epub 2014 Jun 19.

Enhancing missense variant pathogenicity prediction with protein language models using VariPred.利用 VariPred 利用蛋白质语言模型增强错义变异致病性预测。

Sci Rep. 2024 Apr 7;14(1):8136. doi: 10.1038/s41598-024-51489-7.

High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features.利用全卷积神经网络和最小序列特征进行高精度蛋白质接触预测。

Bioinformatics. 2018 Oct 1;34(19):3308-3315. doi: 10.1093/bioinformatics/bty341.

Systematic analysis of short internal indels and their impact on protein folding.短内部插入缺失及其对蛋白质折叠影响的系统分析。

BMC Struct Biol. 2010 Aug 4;10:24. doi: 10.1186/1472-6807-10-24.

The Framework of Computational Protein Design.计算蛋白质设计框架

Methods Mol Biol. 2017;1529:3-19. doi: 10.1007/978-1-4939-6637-0_1.

引用本文的文献

Rational protein engineering using an omni-directional multipoint mutagenesis generation pipeline.利用全向多点诱变生成流程进行合理的蛋白质工程。

iScience. 2025 Aug 5;28(9):113273. doi: 10.1016/j.isci.2025.113273. eCollection 2025 Sep 19.

Applications of Artificial Intelligence in Biotech Drug Discovery and Product Development.人工智能在生物技术药物发现与产品开发中的应用。

MedComm (2020). 2025 Jul 30;6(8):e70317. doi: 10.1002/mco2.70317. eCollection 2025 Aug.

EvoNB: A protein language model-based workflow for nanobody mutation prediction and optimization.EvoNB：一种基于蛋白质语言模型的纳米抗体突变预测与优化工作流程。

J Pharm Anal. 2025 Jun;15(6):101260. doi: 10.1016/j.jpha.2025.101260. Epub 2025 Mar 10.

Synergizing Attribute-Guided Latent Space Exploration (AGLSE) with Classical Molecular Simulations to Design Potent Pep-Magnet Peptide Inhibitors to Abrogate SARS-CoV-2 Host Cell Entry.将属性引导的潜在空间探索（AGLSE）与经典分子模拟相结合，以设计有效的 Pep-Magnet 肽抑制剂来阻断 SARS-CoV-2 进入宿主细胞。

Viruses. 2025 Jun 7;17(6):828. doi: 10.3390/v17060828.

Developing drug-like single-domain antibodies (VHH) from in vitro libraries.从体外文库开发类药物单域抗体（VHH）

MAbs. 2025 Dec;17(1):2516676. doi: 10.1080/19420862.2025.2516676. Epub 2025 Jun 25.

Ultrafast classical phylogenetic method beats large protein language models on variant effect prediction.超快经典系统发育方法在变异效应预测方面胜过大型蛋白质语言模型。

Adv Neural Inf Process Syst. 2024;37:130265-130290.

Nanobodies: From Discovery to AI-Driven Design.纳米抗体：从发现到人工智能驱动的设计

Biology (Basel). 2025 May 14;14(5):547. doi: 10.3390/biology14050547.

Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability.面向增强生物活性和热稳定性的语义和几何蛋白质编码

Elife. 2025 May 2;13:RP98033. doi: 10.7554/eLife.98033.

Variant effect predictor correlation with functional assays is reflective of clinical classification performance.变异效应预测器与功能测定的相关性反映了临床分类性能。

Genome Biol. 2025 Apr 22;26(1):104. doi: 10.1186/s13059-025-03575-w.

Revolutionizing oncology: the role of Artificial Intelligence (AI) as an antibody design, and optimization tools.肿瘤学的变革：人工智能（AI）作为抗体设计与优化工具的作用。

Biomark Res. 2025 Mar 29;13(1):52. doi: 10.1186/s40364-025-00764-4.

本文引用的文献

Rapid generation of potent antibodies by autonomous hypermutation in yeast.酵母自主超突变快速产生有效抗体。

Nat Chem Biol. 2021 Oct;17(10):1057-1064. doi: 10.1038/s41589-021-00832-4. Epub 2021 Jun 24.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences.利用基因组序列计算的共进化进行残基分辨率下的大规模蛋白质相互作用发现。

Nat Commun. 2021 Mar 2;12(1):1396. doi: 10.1038/s41467-021-21636-z.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

A guide to: generation and design of nanobodies.纳米抗体的生成与设计指南

FEBS J. 2021 Apr;288(7):2084-2102. doi: 10.1111/febs.15515. Epub 2020 Aug 28.

A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences.用于最大化合成DNA和蛋白质序列适应性与多样性的生成神经网络。

Cell Syst. 2020 Jul 22;11(1):49-62.e16. doi: 10.1016/j.cels.2020.05.007. Epub 2020 Jun 25.

Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations.利用深度突变扫描对变异效应预测器进行基准测试，并识别疾病突变。

Mol Syst Biol. 2020 Jul;16(7):e9380. doi: 10.15252/msb.20199380.

How repertoire data are changing antibody science.抗体科学如何因库数据而改变。

J Biol Chem. 2020 Jul 17;295(29):9823-9837. doi: 10.1074/jbc.REV120.010181. Epub 2020 May 14.

Pan-cancer analysis of whole genomes.泛癌症全基因组分析。

Nature. 2020 Feb;578(7793):82-93. doi: 10.1038/s41586-020-1969-6. Epub 2020 Feb 5.

UDSMProt: universal deep sequence models for protein classification.UDSMProt：用于蛋白质分类的通用深度序列模型。

Bioinformatics. 2020 Apr 15;36(8):2401-2409. doi: 10.1093/bioinformatics/btaa003.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

使用自回归生成模型进行蛋白质设计和变体预测。

Protein design and variant prediction using autoregressive generative models.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献