• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

BetaAlign:一种用于多序列比对的深度学习方法。

BetaAlign: a deep learning approach for multiple sequence alignment.

作者信息

Dotan Edo, Wygoda Elya, Ecker Noa, Alburquerque Michael, Avram Oren, Belinkov Yonatan, Pupko Tal

机构信息

The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.

The Henry and Marilyn Taub Faculty of Computer Science, Technion-Israel Institute of Technology, Haifa 3200003, Israel.

出版信息

Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btaf009.

DOI:10.1093/bioinformatics/btaf009
PMID:39775454
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11758787/
Abstract

MOTIVATION

Multiple sequence alignments (MSAs) are extensively used in biology, from phylogenetic reconstruction to structure and function prediction. Here, we suggest an out-of-the-box approach for the inference of MSAs, which relies on algorithms developed for processing natural languages. We show that our artificial intelligence (AI)-based methodology can be trained to align sequences by processing alignments that are generated via simulations, and thus different aligners can be easily generated for datasets with specific evolutionary dynamics attributes. We expect that natural language processing (NLP) solutions will replace or augment classic solutions for computing alignments, and more generally, challenging inference tasks in phylogenomics.

RESULTS

The MSA problem is a fundamental pillar in bioinformatics, comparative genomics, and phylogenetics. Here, we characterize and improve BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on NLP techniques and trains transformers to map a set of unaligned biological sequences to an MSA. We show that our approach is highly accurate, comparable and sometimes better than state-of-the-art alignment tools. We characterize the performance of BetaAlign and the effect of various aspects on accuracy; for example, the size of the training data, the effect of different transformer architectures, and the effect of learning on a subspace of indel-model parameters (subspace learning). We also introduce a new technique that leads to improved performance compared to our previous approach. Our findings further uncover the potential of NLP-based methods for sequence alignment, highlighting that AI-based algorithms can substantially challenge classic approaches in phylogenomics and bioinformatics.

AVAILABILITY AND IMPLEMENTATION

Datasets used in this work are available on HuggingFace (Wolf et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. p.38-45. 2020) at: https://huggingface.co/dotan1111. Source code is available at: https://github.com/idotan286/SimulateAlignments.

摘要

动机

多序列比对(MSA)在生物学中被广泛应用,从系统发育重建到结构与功能预测。在此,我们提出一种全新的MSA推断方法,该方法依赖于为处理自然语言而开发的算法。我们表明,基于人工智能(AI)的方法可以通过处理模拟生成的比对来训练以对齐序列,因此可以针对具有特定进化动力学属性的数据集轻松生成不同的比对器。我们期望自然语言处理(NLP)解决方案将取代或增强用于计算比对的经典解决方案,更广泛地说,取代或增强系统发育基因组学中具有挑战性的推断任务。

结果

MSA问题是生物信息学、比较基因组学和系统发育学的一个基本支柱。在此,我们对第一个深度学习比对器BetaAlign进行了特征描述和改进,它与传统的比对计算算法有很大不同。BetaAlign借鉴了NLP技术,并训练Transformer将一组未对齐的生物序列映射为一个MSA。我们表明,我们的方法高度准确,与当前最先进的比对工具相当,有时甚至更好。我们描述了BetaAlign的性能以及各个方面对准确性的影响;例如,训练数据的大小、不同Transformer架构的影响以及在插入缺失模型参数子空间上学习的影响(子空间学习)。我们还引入了一种新技术,与我们之前的方法相比,该技术提高了性能。我们的发现进一步揭示了基于NLP的序列比对方法的潜力,突出了基于AI的算法在系统发育基因组学和生物信息学中对经典方法构成的重大挑战。

可用性与实现

本工作中使用的数据集可在HuggingFace(Wolf等人,《Transformer:自然语言处理的最新技术》。载于《2020年自然语言处理经验方法会议论文集:系统演示》。第38 - 45页。202年)获取,网址为:https://huggingface.co/dotan1111。源代码可在:https://github.com/idotan286/SimulateAlignments获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/ef390e8635db/btaf009f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/a3d6a44b1c02/btaf009f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/e0efead65229/btaf009f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/cbc2a6443ddf/btaf009f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/3f2f718227c3/btaf009f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/c0556b702399/btaf009f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/beecf33b4556/btaf009f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/ef390e8635db/btaf009f7.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/a3d6a44b1c02/btaf009f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/e0efead65229/btaf009f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/cbc2a6443ddf/btaf009f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/3f2f718227c3/btaf009f4.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/c0556b702399/btaf009f5.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/beecf33b4556/btaf009f6.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/acf4/11758787/ef390e8635db/btaf009f7.jpg

相似文献

1
BetaAlign: a deep learning approach for multiple sequence alignment.BetaAlign:一种用于多序列比对的深度学习方法。
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btaf009.
2
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
3
Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map.使用完全似然得分和位置偏移图对多序列比对错误进行表征。
BMC Bioinformatics. 2016 Mar 18;17:133. doi: 10.1186/s12859-016-0945-5.
4
Improving Protein Secondary Structure Prediction by Deep Language Models and Transformer Networks.深度学习语言模型和变换网络在蛋白质二级结构预测中的改进。
Methods Mol Biol. 2025;2867:43-53. doi: 10.1007/978-1-0716-4196-5_3.
5
A machine-learning-based alternative to phylogenetic bootstrap.基于机器学习的替代系统,用于替代系统发育 bootstrap 分析。
Bioinformatics. 2024 Jun 28;40(Suppl 1):i208-i217. doi: 10.1093/bioinformatics/btae255.
6
learnMSA2: deep protein multiple alignments with large language and hidden Markov models.learnMSA2:基于大型语言模型和隐马尔可夫模型的深度蛋白质多重比对。
Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.
7
Multiple Sequence Alignment Computation Using the T-Coffee Regressive Algorithm Implementation.使用T-Coffee回归算法实现的多序列比对计算
Methods Mol Biol. 2021;2231:89-97. doi: 10.1007/978-1-0716-1036-7_6.
8
Protein multiple sequence alignment benchmarking through secondary structure prediction.通过二级结构预测进行蛋白质多序列比对基准测试。
Bioinformatics. 2017 May 1;33(9):1331-1337. doi: 10.1093/bioinformatics/btw840.
9
TreeWave: command line tool for alignment-free phylogeny reconstruction based on graphical representation of DNA sequences and genomic signal processing.TreeWave:基于 DNA 序列图形表示和基因组信号处理的无比对系统发育重建命令行工具。
BMC Bioinformatics. 2024 Nov 27;25(1):367. doi: 10.1186/s12859-024-05992-3.
10
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.CMSA:一种用于多个相似RNA/DNA序列比对的异构CPU/GPU计算系统。
BMC Bioinformatics. 2017 Jun 24;18(1):315. doi: 10.1186/s12859-017-1725-6.

引用本文的文献

1
Enhancing sequence alignment of adaptive immune receptors through multi-task deep learning.通过多任务深度学习增强适应性免疫受体的序列比对
Nucleic Acids Res. 2025 Jul 8;53(13). doi: 10.1093/nar/gkaf651.

本文引用的文献

1
Effect of tokenization on transformers for biological sequences.词元化对生物序列变压器模型的影响。
Bioinformatics. 2024 Mar 29;40(4). doi: 10.1093/bioinformatics/btae196.
2
Muscle5: High-accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny.肌肉 5:高精度比对集合可实现序列同源性和系统发育的无偏评估。
Nat Commun. 2022 Nov 15;13(1):6968. doi: 10.1038/s41467-022-34630-w.
3
A Probabilistic Model for Indel Evolution: Differentiating Insertions from Deletions.一种插入/缺失进化的概率模型:区分插入和缺失。
Mol Biol Evol. 2021 Dec 9;38(12):5769-5781. doi: 10.1093/molbev/msab266.
4
BAli-Phy version 3: model-based co-estimation of alignment and phylogeny.BAli-Phy版本3:基于模型的比对与系统发育共同估计
Bioinformatics. 2021 Sep 29;37(18):3032-3034. doi: 10.1093/bioinformatics/btab129.
5
Short-range template switching in great ape genomes explored using pair hidden Markov models.使用对隐马尔可夫模型探索大型猿类基因组中的短程模板切换。
PLoS Genet. 2021 Mar 2;17(3):e1009221. doi: 10.1371/journal.pgen.1009221. eCollection 2021 Mar.
6
ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data.ETE 3:系统发育基因组数据的重建、分析与可视化
Mol Biol Evol. 2016 Jun;33(6):1635-8. doi: 10.1093/molbev/msw046. Epub 2016 Feb 26.
7
GUIDANCE2: accurate detection of unreliable alignment regions accounting for the uncertainty of multiple parameters.指南2:考虑多个参数的不确定性,准确检测不可靠的比对区域。
Nucleic Acids Res. 2015 Jul 1;43(W1):W7-14. doi: 10.1093/nar/gkv318. Epub 2015 Apr 16.
8
TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction.TCS:一种新的多重序列比对可靠性度量方法,用于估计比对准确性并改进系统发育树重建。
Mol Biol Evol. 2014 Jun;31(6):1625-37. doi: 10.1093/molbev/msu117. Epub 2014 Apr 1.
9
Phylogeny-aware alignment with PRANK.使用PRANK进行系统发育感知比对。
Methods Mol Biol. 2014;1079:155-70. doi: 10.1007/978-1-62703-646-7_10.
10
Who watches the watchmen? An appraisal of benchmarks for multiple sequence alignment.谁来监督监督者?多重序列比对基准的评估。
Methods Mol Biol. 2014;1079:59-73. doi: 10.1007/978-1-62703-646-7_4.