Suppr超能文献

FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China.

出版信息

Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.

Abstract

MOTIVATION

In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly.

RESULTS

FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.

摘要

动机

在生物信息学中,多序列比对(MSA)是一项至关重要的任务。然而,传统方法在对齐超长序列时常常遇到困难。为了解决这个问题,研究人员设计了基于垂直划分策略的 MSA 方法,该策略将序列数据分段进行并行对齐。这种方法的一个主要例子是 FMAlign,它利用 FM-index 提取公共种子,并相应地对序列进行分段。

结果

FMAlign2 利用后缀数组来识别最大精确匹配,从而重新定义了 FMAlign 的方法,从搜索全局链到部分链。通过使用垂直划分策略,将大规模问题分解为可管理的任务,从而能够并行执行子 MSA。此外,还进行了序列-轮廓对齐和细化,以拼接子集,从而无缝地生成最终结果。与 FMAlign 相比,FMAlign2 显著增加了序列的分段,并在保持准确性的同时大大减少了时间,尤其是在超长数据集上。重要的是,FMAlign2 通过赋予在可接受的时间内处理长度达到数十亿的序列的能力,增强了现有的 MSA 方法。

可用性和实现

源代码和数据集可在 https://github.com/malabz/FMAlign2https://zenodo.org/records/10435770 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/2ddc89c7c2a2/btae014f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验