• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

SESNet:用于数据高效蛋白质工程的序列-结构特征整合深度学习方法。

SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering.

作者信息

Li Mingchen, Kang Liqi, Xiong Yi, Wang Yu Guang, Fan Guisheng, Tan Pan, Hong Liang

机构信息

Shanghai National Center for Applied Mathematics (SJTU Center), & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200240, China.

出版信息

J Cheminform. 2023 Feb 3;15(1):12. doi: 10.1186/s13321-023-00688-x.

DOI:10.1186/s13321-023-00688-x
PMID:36737798
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9898993/
Abstract

Deep learning has been widely used for protein engineering. However, it is limited by the lack of sufficient experimental data to train an accurate model for predicting the functional fitness of high-order mutants. Here, we develop SESNet, a supervised deep-learning model to predict the fitness for protein mutants by leveraging both sequence and structure information, and exploiting attention mechanism. Our model integrates local evolutionary context from homologous sequences, the global evolutionary context encoding rich semantic from the universal protein sequence space and the structure information accounting for the microenvironment around each residue in a protein. We show that SESNet outperforms state-of-the-art models for predicting the sequence-function relationship on 26 deep mutational scanning datasets. More importantly, we propose a data augmentation strategy by leveraging the data from unsupervised models to pre-train our model. After that, our model can achieve strikingly high accuracy in prediction of the fitness of protein mutants, especially for the higher order variants (> 4 mutation sites), when finetuned by using only a small number of experimental mutation data (< 50). The strategy proposed is of great practical value as the required experimental effort, i.e., producing a few tens of experimental mutation data on a given protein, is generally affordable by an ordinary biochemical group and can be applied on almost any protein.

摘要

深度学习已广泛应用于蛋白质工程。然而,它受到缺乏足够实验数据的限制,无法训练出一个准确的模型来预测高阶突变体的功能适应性。在此,我们开发了SESNet,这是一种监督式深度学习模型,通过利用序列和结构信息并运用注意力机制来预测蛋白质突变体的适应性。我们的模型整合了来自同源序列的局部进化背景、从通用蛋白质序列空间编码丰富语义的全局进化背景以及考虑蛋白质中每个残基周围微环境的结构信息。我们表明,在26个深度突变扫描数据集上,SESNet在预测序列-功能关系方面优于现有最先进的模型。更重要的是,我们提出了一种数据增强策略,通过利用来自无监督模型的数据对我们的模型进行预训练。在此之后,当仅使用少量实验突变数据(<50个)进行微调时,我们的模型在预测蛋白质突变体的适应性方面能够达到极高的准确率,特别是对于高阶变体(>4个突变位点)。所提出的策略具有很大的实用价值,因为所需的实验工作量,即在给定蛋白质上生成几十条实验突变数据,通常是普通生化小组能够承受的,并且几乎可以应用于任何蛋白质。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/931844b327c6/13321_2023_688_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/75d4da67ef50/13321_2023_688_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/9763027c4dc2/13321_2023_688_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/d381f71fe7d8/13321_2023_688_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/78be68522687/13321_2023_688_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/931844b327c6/13321_2023_688_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/75d4da67ef50/13321_2023_688_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/9763027c4dc2/13321_2023_688_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/d381f71fe7d8/13321_2023_688_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/78be68522687/13321_2023_688_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3914/9898993/931844b327c6/13321_2023_688_Fig5_HTML.jpg

相似文献

1
SESNet: sequence-structure feature-integrated deep learning method for data-efficient protein engineering.SESNet:用于数据高效蛋白质工程的序列-结构特征整合深度学习方法。
J Cheminform. 2023 Feb 3;15(1):12. doi: 10.1186/s13321-023-00688-x.
2
ECNet is an evolutionary context-integrated deep learning framework for protein engineering.ECNet 是一种用于蛋白质工程的进化上下文集成深度学习框架。
Nat Commun. 2021 Sep 30;12(1):5743. doi: 10.1038/s41467-021-25976-8.
3
Ensemble Learning with Supervised Methods Based on Large-Scale Protein Language Models for Protein Mutation Effects Prediction.基于大规模蛋白质语言模型的监督方法的集成学习在蛋白质突变效应预测中的应用。
Int J Mol Sci. 2023 Nov 18;24(22):16496. doi: 10.3390/ijms242216496.
4
Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning.通过少样本学习利用最少的湿实验数据提高蛋白质语言模型的效率。
Nat Commun. 2024 Jul 2;15(1):5566. doi: 10.1038/s41467-024-49798-6.
5
Protein multi-level structure feature-integrated deep learning method for mutational effect prediction.基于蛋白质多层次结构特征的深度学习基因突变效应预测方法。
Biotechnol J. 2024 Aug;19(8):e2400203. doi: 10.1002/biot.202400203.
6
MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction.MG-BERT:利用无监督原子表示学习进行分子性质预测。
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab152.
7
Deep Ordinal Hashing With Spatial Attention.深度序哈希与空间注意力。
IEEE Trans Image Process. 2019 May;28(5):2173-2186. doi: 10.1109/TIP.2018.2883522. Epub 2018 Nov 28.
8
Local contrastive loss with pseudo-label based self-training for semi-supervised medical image segmentation.基于伪标签自训练的局部对比损失的半监督医学图像分割。
Med Image Anal. 2023 Jul;87:102792. doi: 10.1016/j.media.2023.102792. Epub 2023 Mar 11.
9
Learning global dependencies and multi-semantics within heterogeneous graph for predicting disease-related lncRNAs.学习异质图中的全局依赖关系和多语义关系,以预测与疾病相关的 lncRNAs。
Brief Bioinform. 2022 Sep 20;23(5). doi: 10.1093/bib/bbac361.
10
DeepACLSTM: deep asymmetric convolutional long short-term memory neural models for protein secondary structure prediction.DeepACLSTM:用于蛋白质二级结构预测的深度非对称卷积长短时记忆神经模型。
BMC Bioinformatics. 2019 Jun 17;20(1):341. doi: 10.1186/s12859-019-2940-0.

引用本文的文献

1
AI-driven epitope prediction: a system review, comparative analysis, and practical guide for vaccine development.人工智能驱动的表位预测:疫苗开发的系统综述、比较分析及实用指南
NPJ Vaccines. 2025 Aug 30;10(1):207. doi: 10.1038/s41541-025-01258-y.
2
Designing diverse and high-performance proteins with a large language model in the loop.利用大语言模型循环设计多样化且高性能的蛋白质。
PLoS Comput Biol. 2025 Jun 5;21(6):e1013119. doi: 10.1371/journal.pcbi.1013119. eCollection 2025 Jun.
3
AI-enabled alkaline-resistant evolution of protein to apply in mass production.

本文引用的文献

1
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
2
Nucleotide augmentation for machine learning-guided protein engineering.用于机器学习引导蛋白质工程的核苷酸增强
Bioinform Adv. 2022 Dec 9;3(1):vbac094. doi: 10.1093/bioadv/vbac094. eCollection 2023.
3
Machine learning-aided engineering of hydrolases for PET depolymerization.基于机器学习的 PET 解聚用水解酶工程。
用于大规模生产的人工智能辅助蛋白质抗碱进化。
Elife. 2025 Feb 19;13:RP102788. doi: 10.7554/eLife.102788.
4
Optimizing enzyme thermostability by combining multiple mutations using protein language model.利用蛋白质语言模型组合多个突变来优化酶的热稳定性。
mLife. 2024 Dec 26;3(4):492-504. doi: 10.1002/mlf2.12151. eCollection 2024 Dec.
5
Deep learning in template-free de novo biosynthetic pathway design of natural products.无模板的天然产物从头生物合成途径设计中的深度学习。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae495.
6
An end-to-end framework for the prediction of protein structure and fitness from single sequence.从单序列预测蛋白质结构和适应性的端到端框架。
Nat Commun. 2024 Aug 27;15(1):7400. doi: 10.1038/s41467-024-51776-x.
7
PETA: evaluating the impact of protein transfer learning with sub-word tokenization on downstream applications.PETA:评估基于子词标记化的蛋白质迁移学习对下游应用的影响。
J Cheminform. 2024 Aug 2;16(1):92. doi: 10.1186/s13321-024-00884-3.
8
Biophysics-based protein language models for protein engineering.用于蛋白质工程的基于生物物理学的蛋白质语言模型。
bioRxiv. 2025 Jan 14:2024.03.15.585128. doi: 10.1101/2024.03.15.585128.
9
Prediction of cancer driver genes and mutations: the potential of integrative computational frameworks.预测癌症驱动基因和突变:集成计算框架的潜力。
Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbad519.
Nature. 2022 Apr;604(7907):662-667. doi: 10.1038/s41586-022-04599-z. Epub 2022 Apr 27.
4
LM-GVP: an extensible sequence and structure informed deep learning framework for protein property prediction.LM-GVP:一个可扩展的序列和结构信息深度学习框架,用于蛋白质性质预测。
Sci Rep. 2022 Apr 27;12(1):6832. doi: 10.1038/s41598-022-10775-y.
5
Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins.蛋白质语言模型的进化速度可预测多种蛋白质的进化动态。
Cell Syst. 2022 Apr 20;13(4):274-285.e6. doi: 10.1016/j.cels.2022.01.003. Epub 2022 Feb 3.
6
Learning protein fitness models from evolutionary and assay-labeled data.从进化和实验标记数据中学习蛋白质适应性模型。
Nat Biotechnol. 2022 Jul;40(7):1114-1122. doi: 10.1038/s41587-021-01146-5. Epub 2022 Jan 17.
7
ProteinBERT: a universal deep-learning model of protein sequence and function.蛋白质 BERT:一种通用的蛋白质序列和功能深度学习模型。
Bioinformatics. 2022 Apr 12;38(8):2102-2110. doi: 10.1093/bioinformatics/btac020.
8
Neural networks to learn protein sequence-function relationships from deep mutational scanning data.神经网络从深度突变扫描数据中学习蛋白质序列-功能关系。
Proc Natl Acad Sci U S A. 2021 Nov 30;118(48). doi: 10.1073/pnas.2104878118.
9
AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models.AlphaFold 蛋白质结构数据库:用高精度模型极大地扩展蛋白质序列空间的结构覆盖范围。
Nucleic Acids Res. 2022 Jan 7;50(D1):D439-D444. doi: 10.1093/nar/gkab1061.
10
Disease variant prediction with deep generative models of evolutionary data.利用进化数据的深度生成模型进行疾病变异预测。
Nature. 2021 Nov;599(7883):91-95. doi: 10.1038/s41586-021-04043-8. Epub 2021 Oct 27.