FMAlign2：一种新颖的快速多核苷酸序列比对方法，适用于超大数据集。

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China.

出版信息

Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.

DOI:10.1093/bioinformatics/btae014

PMID:38200554

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10809904/

Abstract

MOTIVATION

In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly.

RESULTS

FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.

摘要

动机

在生物信息学中，多序列比对（MSA）是一项至关重要的任务。然而，传统方法在对齐超长序列时常常遇到困难。为了解决这个问题，研究人员设计了基于垂直划分策略的 MSA 方法，该策略将序列数据分段进行并行对齐。这种方法的一个主要例子是 FMAlign，它利用 FM-index 提取公共种子，并相应地对序列进行分段。

结果

FMAlign2 利用后缀数组来识别最大精确匹配，从而重新定义了 FMAlign 的方法，从搜索全局链到部分链。通过使用垂直划分策略，将大规模问题分解为可管理的任务，从而能够并行执行子 MSA。此外，还进行了序列-轮廓对齐和细化，以拼接子集，从而无缝地生成最终结果。与 FMAlign 相比，FMAlign2 显著增加了序列的分段，并在保持准确性的同时大大减少了时间，尤其是在超长数据集上。重要的是，FMAlign2 通过赋予在可接受的时间内处理长度达到数十亿的序列的能力，增强了现有的 MSA 方法。

可用性和实现

源代码和数据集可在 https://github.com/malabz/FMAlign2 和 https://zenodo.org/records/10435770 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/2ddc89c7c2a2/btae014f1.jpg

相似文献

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.FMAlign2：一种新颖的快速多核苷酸序列比对方法，适用于超大数据集。

Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.

A novel fast multiple nucleotide sequence alignment method based on FM-index.基于 FM-index 的新型快速多核苷酸序列比对方法。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab519.

WMSA: a novel method for multiple sequence alignment of DNA sequences.WMSA：一种用于 DNA 序列多重序列比对的新方法。

Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.

FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots.FAME：一种快速且内存高效的多序列比对工具，通过兼容的根链实现。

Bioinformatics. 2020 Jun 1;36(12):3662-3668. doi: 10.1093/bioinformatics/btaa175.

Suffix tree searcher: exploration of common substrings in large DNA sequence sets.后缀树搜索器：大型DNA序列集中常见子串的探索

BMC Res Notes. 2014 Jul 23;7:466. doi: 10.1186/1756-0500-7-466.

TPMA: A two pointers meta-alignment tool to ensemble different multiple nucleic acid sequence alignments.TPMA：一种双指针元比对工具，用于集成不同的多个核酸序列比对。

PLoS Comput Biol. 2024 Apr 1;20(4):e1011988. doi: 10.1371/journal.pcbi.1011988. eCollection 2024 Apr.

UPP2: fast and accurate alignment of datasets with fragmentary sequences.UPP2：快速准确地对齐具有片段序列的数据集。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.

CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.CMSA：一种用于多个相似RNA/DNA序列比对的异构CPU/GPU计算系统。

BMC Bioinformatics. 2017 Jun 24;18(1):315. doi: 10.1186/s12859-017-1725-6.

Recursive MAGUS: Scalable and accurate multiple sequence alignment.递归 MAGUS：可扩展且精确的多重序列比对。

PLoS Comput Biol. 2021 Oct 6;17(10):e1008950. doi: 10.1371/journal.pcbi.1008950. eCollection 2021 Oct.

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes.病毒 MSA：大规模可扩展的基于参考的病毒基因组多重序列比对。

Bioinformatics. 2021 May 5;37(5):714-716. doi: 10.1093/bioinformatics/btaa743.

引用本文的文献

Generating multiple alignments on a pangenomic scale.在泛基因组规模上生成多个比对。

Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf104.

FORAlign: accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on Four Russians approach with linear space complexity.FORAlign：基于四俄罗斯人方法，利用FOR块加速具有线性空间复杂度的间隙仿射DNA双序列比对。

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf061.

Fast sequence alignment for centromeres with RaMA.使用RaMA对着丝粒进行快速序列比对。

Genome Res. 2025 May 2;35(5):1209-1218. doi: 10.1101/gr.279763.124.

ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments.ReAlign-N：一种用于多核酸序列比对的综合重排方法，结合了全局和局部重排。

NAR Genom Bioinform. 2024 Dec 18;6(4):lqae170. doi: 10.1093/nargab/lqae170. eCollection 2024 Dec.

HAlign 4: a new strategy for rapidly aligning millions of sequences.HAlign 4：一种快速比对数百万条序列的新策略。

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.

miniSNV: accurate and fast single nucleotide variant calling from nanopore sequencing data.miniSNV：从纳米孔测序数据中进行准确快速的单核苷酸变异calling。

Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae473.

Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。

Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.

本文引用的文献

WMSA: a novel method for multiple sequence alignment of DNA sequences.WMSA：一种用于 DNA 序列多重序列比对的新方法。

Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.

HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.HAlign 3：快速对齐超大量相似 DNA/RNA 序列。

Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.

A survey on the algorithm and development of multiple sequence alignment.多序列比对算法与发展研究综述。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac069.

A novel fast multiple nucleotide sequence alignment method based on FM-index.基于 FM-index 的新型快速多核苷酸序列比对方法。

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab519.

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort：为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换

Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.

FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots.FAME：一种快速且内存高效的多序列比对工具，通过兼容的根链实现。

Bioinformatics. 2020 Jun 1;36(12):3662-3668. doi: 10.1093/bioinformatics/btaa175.

Earth BioGenome Project: Sequencing life for the future of life.地球生物基因组计划：为生命的未来测序生命。

Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333. doi: 10.1073/pnas.1720115115.

MUMmer4: A fast and versatile genome alignment system.MUMmer4：一种快速且通用的基因组比对系统。

PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.

HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.HAlign：基于中心星型策略的快速多重相似DNA/RNA序列比对

Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.

The top 100 papers.前100篇论文。

Nature. 2014 Oct 30;514(7524):550-3. doi: 10.1038/514550a.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

FMAlign2：一种新颖的快速多核苷酸序列比对方法，适用于超大数据集。

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

机构信息

出版信息

MOTIVATION

RESULTS

AVAILABILITY AND IMPLEMENTATION

动机

结果

可用性和实现

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献