• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

蛋白质序列的深度嵌入与比对

Deep embedding and alignment of protein sequences.

作者信息

Llinares-López Felipe, Berthet Quentin, Blondel Mathieu, Teboul Olivier, Vert Jean-Philippe

机构信息

Brain Team, Google Research, Paris, France.

出版信息

Nat Methods. 2023 Jan;20(1):104-111. doi: 10.1038/s41592-022-01700-2. Epub 2022 Dec 15.

DOI:10.1038/s41592-022-01700-2
PMID:36522501
Abstract

Protein sequence alignment is a key component of most bioinformatics pipelines to study the structures and functions of proteins. Aligning highly divergent sequences remains, however, a difficult task that current algorithms often fail to perform accurately, leaving many proteins or open reading frames poorly annotated. Here we leverage recent advances in deep learning for language modeling and differentiable programming to propose DEDAL (deep embedding and differentiable alignment), a flexible model to align protein sequences and detect homologs. DEDAL is a machine learning-based model that learns to align sequences by observing large datasets of raw protein sequences and of correct alignments. Once trained, we show that DEDAL improves by up to two- or threefold the alignment correctness over existing methods on remote homologs and better discriminates remote homologs from evolutionarily unrelated sequences, paving the way to improvements on many downstream tasks relying on sequence alignment in structural and functional genomics.

摘要

蛋白质序列比对是大多数生物信息学流程中研究蛋白质结构和功能的关键组成部分。然而,比对高度分化的序列仍然是一项艰巨的任务,当前的算法常常难以准确执行,导致许多蛋白质或开放阅读框注释不佳。在此,我们利用深度学习在语言建模和可微编程方面的最新进展,提出了DEDAL(深度嵌入和可微比对),这是一种用于比对蛋白质序列和检测同源物的灵活模型。DEDAL是一种基于机器学习的模型,它通过观察原始蛋白质序列和正确比对的大型数据集来学习比对序列。经过训练后,我们表明DEDAL在远程同源物上的比对正确性比现有方法提高了两到三倍,并且能更好地将远程同源物与进化上不相关的序列区分开来,为改善结构和功能基因组学中许多依赖序列比对的下游任务铺平了道路。

相似文献

1
Deep embedding and alignment of protein sequences.蛋白质序列的深度嵌入与比对
Nat Methods. 2023 Jan;20(1):104-111. doi: 10.1038/s41592-022-01700-2. Epub 2022 Dec 15.
2
Protein embedding based alignment.基于蛋白质嵌入的对齐。
BMC Bioinformatics. 2024 Feb 28;25(1):85. doi: 10.1186/s12859-024-05699-5.
3
transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences.transAlign:利用氨基酸促进蛋白质编码DNA序列的多重比对。
BMC Bioinformatics. 2005 Jun 22;6:156. doi: 10.1186/1471-2105-6-156.
4
Pairing interacting protein sequences using masked language modeling.使用掩蔽语言模型对相互作用的蛋白质序列进行配对。
Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311887121. doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.
5
End-to-end learning of multiple sequence alignments with differentiable Smith-Waterman.基于可微分 Smith-Waterman 的多序列比对端到端学习。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btac724.
6
Scoring alignments by embedding vector similarity.通过嵌入向量相似度对配准进行评分。
Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae178.
7
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
8
Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction.蛋白质结构比对在用于结构预测的迭代隐马尔可夫模型协议中的应用。
BMC Bioinformatics. 2006 Sep 14;7:410. doi: 10.1186/1471-2105-7-410.
9
Application of the MAHDS Method for Multiple Alignment of Highly Diverged Amino Acid Sequences.MAHDS方法在高度分化氨基酸序列多重比对中的应用。
Int J Mol Sci. 2022 Mar 29;23(7):3764. doi: 10.3390/ijms23073764.
10
CLUSS: clustering of protein sequences based on a new similarity measure.CLUSS:基于一种新的相似性度量对蛋白质序列进行聚类。
BMC Bioinformatics. 2007 Aug 4;8:286. doi: 10.1186/1471-2105-8-286.

引用本文的文献

1
PLMSearch and PLMAlign: Protein Language Model (PLM)-Based Homologous Protein Sequence Search and Alignment.PLMSearch和PLMAlign:基于蛋白质语言模型(PLM)的同源蛋白质序列搜索与比对
Methods Mol Biol. 2025;2941:227-241. doi: 10.1007/978-1-0716-4623-6_14.
2
Phyloformer: Fast, Accurate, and Versatile Phylogenetic Reconstruction with Deep Neural Networks.Phyloformer:使用深度神经网络进行快速、准确且通用的系统发育重建。
Mol Biol Evol. 2025 Apr 1;42(4). doi: 10.1093/molbev/msaf051.
3
Detection of circular permutations by Protein Language Models.
通过蛋白质语言模型检测环形排列
Comput Struct Biotechnol J. 2024 Dec 30;27:214-220. doi: 10.1016/j.csbj.2024.12.029. eCollection 2025.
4
Major advances in protein function assignment by remote homolog detection with protein language models - A review.利用蛋白质语言模型通过远程同源性检测进行蛋白质功能分配的重大进展——综述
Curr Opin Struct Biol. 2025 Feb;90:102984. doi: 10.1016/j.sbi.2025.102984. Epub 2025 Jan 27.
5
Bilingual language model for protein sequence and structure.用于蛋白质序列和结构的双语语言模型。
NAR Genom Bioinform. 2024 Nov 15;6(4):lqae150. doi: 10.1093/nargab/lqae150. eCollection 2024 Dec.
6
High fitness paths can connect proteins with low sequence overlap.高适应性路径可以连接序列重叠度低的蛋白质。
ArXiv. 2024 Nov 13:arXiv:2411.09054v1.
7
High fitness paths can connect proteins with low sequence overlap.高适应性路径可以连接序列重叠度低的蛋白质。
bioRxiv. 2024 Nov 15:2024.11.13.623265. doi: 10.1101/2024.11.13.623265.
8
SHARK enables sensitive detection of evolutionary homologs and functional analogs in unalignable and disordered sequences.SHARK 能够在不可比对和无序序列中灵敏地检测进化同源物和功能类似物。
Proc Natl Acad Sci U S A. 2024 Oct 15;121(42):e2401622121. doi: 10.1073/pnas.2401622121. Epub 2024 Oct 9.
9
learnMSA2: deep protein multiple alignments with large language and hidden Markov models.learnMSA2:基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.
10
Optimizing protein sequence classification: integrating deep learning models with Bayesian optimization for enhanced biological analysis.优化蛋白质序列分类:将深度学习模型与贝叶斯优化相结合,以增强生物分析。
BMC Med Inform Decis Mak. 2024 Aug 27;24(1):236. doi: 10.1186/s12911-024-02631-y.