• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

ProtMamba:一种同源性感知但无比对的蛋白质状态空间模型。

ProtMamba: a homology-aware but alignment-free protein state space model.

作者信息

Sgarbossa Damiano, Malbranke Cyril, Bitbol Anne-Florence

机构信息

Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland.

SIB Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland.

出版信息

Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf348.

DOI:10.1093/bioinformatics/btaf348
PMID:40509866
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12206526/
Abstract

MOTIVATION

Protein language models are enabling advances in elucidating the sequence-to-function mapping, and have important applications in protein design. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but multiple sequence alignment construction is imperfect.

RESULTS

We present ProtMamba, a homology-aware but alignment-free protein language model based on the Mamba architecture. In contrast with attention-based models, ProtMamba efficiently handles very long context, comprising hundreds of protein sequences. It is also computationally efficient. We train ProtMamba on a large dataset of concatenated homologous sequences, using two GPUs. We combine autoregressive modeling and masked language modeling through a fill-in-the-middle training objective. This makes the model adapted to various protein design applications. We demonstrate ProtMamba's usefulness for sequence generation, motif inpainting, fitness prediction, and modeling intrinsically disordered regions. For homolog-conditioned sequence generation, ProtMamba outperforms state-of-the-art models. ProtMamba's competitive performance, despite its relatively small size, sheds light on the importance of long-context conditioning.

AVAILABILITY AND IMPLEMENTATION

A Python implementation of ProtMamba is freely available in our GitHub repository: https://github.com/Bitbol-Lab/ProtMamba-ssm and archived at https://doi.org/10.5281/zenodo.15584634.

摘要

动机

蛋白质语言模型正在推动在阐明序列到功能映射方面取得进展,并在蛋白质设计中具有重要应用。基于多序列比对的模型能够有效地捕捉同源蛋白质序列中的进化信息,但多序列比对构建并不完美。

结果

我们提出了ProtMamba,一种基于曼巴架构的同源性感知但无比对的蛋白质语言模型。与基于注意力的模型不同,ProtMamba能够有效地处理包含数百个蛋白质序列的非常长的上下文。它在计算上也很高效。我们使用两个GPU在一个由串联同源序列组成的大型数据集上训练ProtMamba。我们通过中间填充训练目标将自回归建模和掩码语言建模相结合。这使得该模型适用于各种蛋白质设计应用。我们展示了ProtMamba在序列生成、基序修复、适应性预测和内在无序区域建模方面的有用性。对于同源条件序列生成,ProtMamba优于现有模型。尽管ProtMamba规模相对较小,但其具有竞争力的性能揭示了长上下文条件的重要性。

可用性和实现

ProtMamba的Python实现可在我们的GitHub存储库中免费获取:https://github.com/Bitbol-Lab/ProtMamba-ssm,并保存在https://doi.org/10.5281/zenodo.15584634。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/a223281c5b98/btaf348f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/d0f9b77644ea/btaf348f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/8de92be37698/btaf348f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/ecdea76f469f/btaf348f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/a98b3db42cb9/btaf348f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/a223281c5b98/btaf348f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/d0f9b77644ea/btaf348f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/8de92be37698/btaf348f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/ecdea76f469f/btaf348f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/a98b3db42cb9/btaf348f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6fd1/12206526/a223281c5b98/btaf348f5.jpg

相似文献

1
ProtMamba: a homology-aware but alignment-free protein state space model.ProtMamba:一种同源性感知但无比对的蛋白质状态空间模型。
Bioinformatics. 2025 Jun 2;41(6). doi: 10.1093/bioinformatics/btaf348.
2
PLMSearch and PLMAlign: Protein Language Model (PLM)-Based Homologous Protein Sequence Search and Alignment.PLMSearch和PLMAlign:基于蛋白质语言模型(PLM)的同源蛋白质序列搜索与比对
Methods Mol Biol. 2025;2941:227-241. doi: 10.1007/978-1-0716-4623-6_14.
3
PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions.PairK:用于量化无序区域中蛋白质基序保守性的成对k-mer比对
Protein Sci. 2025 Jan;34(1):e70004. doi: 10.1002/pro.70004.
4
Multi-objective context-guided consensus of a massive array of techniques for the inference of Gene Regulatory Networks.大规模技术的多目标上下文引导共识,用于基因调控网络推断。
Comput Biol Med. 2024 Sep;179:108850. doi: 10.1016/j.compbiomed.2024.108850. Epub 2024 Jul 15.
5
Influence of early through late fusion on pancreas segmentation from imperfectly registered multimodal magnetic resonance imaging.早期至晚期融合对来自配准不完善的多模态磁共振成像的胰腺分割的影响。
J Med Imaging (Bellingham). 2025 Mar;12(2):024008. doi: 10.1117/1.JMI.12.2.024008. Epub 2025 Apr 26.
6
Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage.在预测RNA测序读段覆盖度方面,选择性状态空间模型优于Transformer模型。
bioRxiv. 2025 Feb 17:2025.02.13.638190. doi: 10.1101/2025.02.13.638190.
7
FusionEncoder: identification of intrinsically disordered regions based on multi-feature fusion.融合编码器:基于多特征融合的内在无序区域识别
Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf362.
8
Enhancing Structure-Aware Protein Language Models with Efficient Fine-Tuning for Various Protein Prediction Tasks.通过高效微调增强结构感知蛋白质语言模型以用于各种蛋白质预测任务
Methods Mol Biol. 2025;2941:31-58. doi: 10.1007/978-1-0716-4623-6_2.
9
Nivolumab for adults with Hodgkin's lymphoma (a rapid review using the software RobotReviewer).纳武单抗用于成人霍奇金淋巴瘤(使用RobotReviewer软件进行的快速综述)
Cochrane Database Syst Rev. 2018 Jul 12;7(7):CD012556. doi: 10.1002/14651858.CD012556.pub2.
10
ToxinPred 3.0: An improved method for predicting the toxicity of peptides.ToxinPred 3.0:一种改进的多肽毒性预测方法。
Comput Biol Med. 2024 Sep;179:108926. doi: 10.1016/j.compbiomed.2024.108926. Epub 2024 Jul 21.

引用本文的文献

1
Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.用于从头药物设计的生成式深度学习——一场化学空间奥德赛。
J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.
2
Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage.在预测RNA测序读段覆盖度方面,选择性状态空间模型优于Transformer模型。
bioRxiv. 2025 Feb 17:2025.02.13.638190. doi: 10.1101/2025.02.13.638190.
3
Sequence Modeling Is Not Evolutionary Reasoning.序列建模并非进化推理。

本文引用的文献

1
Simulating 500 million years of evolution with a language model.用语言模型模拟5亿年的进化历程。
Science. 2025 Feb 21;387(6736):850-858. doi: 10.1126/science.ads0018. Epub 2025 Jan 16.
2
Bilingual language model for protein sequence and structure.用于蛋白质序列和结构的双语语言模型。
NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.
3
Expert-guided protein language models enable accurate and blazingly fast fitness prediction.专家指导的蛋白质语言模型可实现准确且超快的适应度预测。
bioRxiv. 2025 Jun 27:2025.01.17.633626. doi: 10.1101/2025.01.17.633626.
Bioinformatics. 2024 Nov 1;40(11). doi: 10.1093/bioinformatics/btae621.
4
Sequence modeling and design from molecular to genome scale with Evo.基于 Evo 在从分子到基因组尺度上进行序列建模和设计。
Science. 2024 Nov 15;386(6723):eado9336. doi: 10.1126/science.ado9336.
5
OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization.OpenFold:重新训练 AlphaFold2 可深入了解其学习机制和泛化能力。
Nat Methods. 2024 Aug;21(8):1514-1524. doi: 10.1038/s41592-024-02272-z. Epub 2024 May 14.
6
Accurate structure prediction of biomolecular interactions with AlphaFold 3.利用 AlphaFold 3 进行生物分子相互作用的精确结构预测。
Nature. 2024 Jun;630(8016):493-500. doi: 10.1038/s41586-024-07487-w. Epub 2024 May 8.
7
Convolutions are competitive with transformers for protein sequence pretraining.卷积运算在蛋白质序列预训练方面与转换器竞争。
Cell Syst. 2024 Mar 20;15(3):286-294.e2. doi: 10.1016/j.cels.2024.01.008. Epub 2024 Feb 29.
8
A new age in protein design empowered by deep learning.深度学习赋能的蛋白质设计新时代。
Cell Syst. 2023 Nov 15;14(11):925-939. doi: 10.1016/j.cels.2023.10.006.
9
ProGen2: Exploring the boundaries of protein language models.ProGen2:探索蛋白质语言模型的边界。
Cell Syst. 2023 Nov 15;14(11):968-978.e3. doi: 10.1016/j.cels.2023.10.002. Epub 2023 Oct 30.
10
Critical assessment of protein intrinsic disorder prediction (CAID) - Results of round 2.蛋白质固有无序预测(CAID)的批判性评估——第 2 轮结果。
Proteins. 2023 Dec;91(12):1925-1934. doi: 10.1002/prot.26582. Epub 2023 Aug 25.