Dotan Edo, Wygoda Elya, Ecker Noa, Alburquerque Michael, Avram Oren, Belinkov Yonatan, Pupko Tal
The Shmunis School of Biomedicine and Cancer Research, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel.
The Henry and Marilyn Taub Faculty of Computer Science, Technion-Israel Institute of Technology, Haifa 3200003, Israel.
Bioinformatics. 2024 Dec 26;41(1). doi: 10.1093/bioinformatics/btaf009.
Multiple sequence alignments (MSAs) are extensively used in biology, from phylogenetic reconstruction to structure and function prediction. Here, we suggest an out-of-the-box approach for the inference of MSAs, which relies on algorithms developed for processing natural languages. We show that our artificial intelligence (AI)-based methodology can be trained to align sequences by processing alignments that are generated via simulations, and thus different aligners can be easily generated for datasets with specific evolutionary dynamics attributes. We expect that natural language processing (NLP) solutions will replace or augment classic solutions for computing alignments, and more generally, challenging inference tasks in phylogenomics.
The MSA problem is a fundamental pillar in bioinformatics, comparative genomics, and phylogenetics. Here, we characterize and improve BetaAlign, the first deep learning aligner, which substantially deviates from conventional algorithms of alignment computation. BetaAlign draws on NLP techniques and trains transformers to map a set of unaligned biological sequences to an MSA. We show that our approach is highly accurate, comparable and sometimes better than state-of-the-art alignment tools. We characterize the performance of BetaAlign and the effect of various aspects on accuracy; for example, the size of the training data, the effect of different transformer architectures, and the effect of learning on a subspace of indel-model parameters (subspace learning). We also introduce a new technique that leads to improved performance compared to our previous approach. Our findings further uncover the potential of NLP-based methods for sequence alignment, highlighting that AI-based algorithms can substantially challenge classic approaches in phylogenomics and bioinformatics.
Datasets used in this work are available on HuggingFace (Wolf et al. Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. p.38-45. 2020) at: https://huggingface.co/dotan1111. Source code is available at: https://github.com/idotan286/SimulateAlignments.
多序列比对(MSA)在生物学中被广泛应用,从系统发育重建到结构与功能预测。在此,我们提出一种全新的MSA推断方法,该方法依赖于为处理自然语言而开发的算法。我们表明,基于人工智能(AI)的方法可以通过处理模拟生成的比对来训练以对齐序列,因此可以针对具有特定进化动力学属性的数据集轻松生成不同的比对器。我们期望自然语言处理(NLP)解决方案将取代或增强用于计算比对的经典解决方案,更广泛地说,取代或增强系统发育基因组学中具有挑战性的推断任务。
MSA问题是生物信息学、比较基因组学和系统发育学的一个基本支柱。在此,我们对第一个深度学习比对器BetaAlign进行了特征描述和改进,它与传统的比对计算算法有很大不同。BetaAlign借鉴了NLP技术,并训练Transformer将一组未对齐的生物序列映射为一个MSA。我们表明,我们的方法高度准确,与当前最先进的比对工具相当,有时甚至更好。我们描述了BetaAlign的性能以及各个方面对准确性的影响;例如,训练数据的大小、不同Transformer架构的影响以及在插入缺失模型参数子空间上学习的影响(子空间学习)。我们还引入了一种新技术,与我们之前的方法相比,该技术提高了性能。我们的发现进一步揭示了基于NLP的序列比对方法的潜力,突出了基于AI的算法在系统发育基因组学和生物信息学中对经典方法构成的重大挑战。
本工作中使用的数据集可在HuggingFace(Wolf等人,《Transformer:自然语言处理的最新技术》。载于《2020年自然语言处理经验方法会议论文集:系统演示》。第38 - 45页。202年)获取,网址为:https://huggingface.co/dotan1111。源代码可在:https://github.com/idotan286/SimulateAlignments获取。