• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。

FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, Sichuan, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China.

出版信息

Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.

DOI:10.1093/bioinformatics/btae014
PMID:38200554
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10809904/
Abstract

MOTIVATION

In bioinformatics, multiple sequence alignment (MSA) is a crucial task. However, conventional methods often struggle with aligning ultralong sequences. To address this issue, researchers have designed MSA methods rooted in a vertical division strategy, which segments sequence data for parallel alignment. A prime example of this approach is FMAlign, which utilizes the FM-index to extract common seeds and segment the sequences accordingly.

RESULTS

FMAlign2 leverages the suffix array to identify maximal exact matches, redefining the approach of FMAlign from searching for global chains to partial chains. By using a vertical division strategy, large-scale problem is deconstructed into manageable tasks, enabling parallel execution of subMSA. Furthermore, sequence-profile alignment and refinement are incorporated to concatenate subsets, yielding the final result seamlessly. Compared to FMAlign, FMAlign2 markedly augments the segmentation of sequences and significantly reduces the time while maintaining accuracy, especially on ultralong datasets. Importantly, FMAlign2 enhances existing MSA methods by conferring the capability to handle sequences reaching billions in length within an acceptable time frame.

AVAILABILITY AND IMPLEMENTATION

Source code and datasets are available at https://github.com/malabz/FMAlign2 and https://zenodo.org/records/10435770.

摘要

动机

在生物信息学中,多序列比对(MSA)是一项至关重要的任务。然而,传统方法在对齐超长序列时常常遇到困难。为了解决这个问题,研究人员设计了基于垂直划分策略的 MSA 方法,该策略将序列数据分段进行并行对齐。这种方法的一个主要例子是 FMAlign,它利用 FM-index 提取公共种子,并相应地对序列进行分段。

结果

FMAlign2 利用后缀数组来识别最大精确匹配,从而重新定义了 FMAlign 的方法,从搜索全局链到部分链。通过使用垂直划分策略,将大规模问题分解为可管理的任务,从而能够并行执行子 MSA。此外,还进行了序列-轮廓对齐和细化,以拼接子集,从而无缝地生成最终结果。与 FMAlign 相比,FMAlign2 显著增加了序列的分段,并在保持准确性的同时大大减少了时间,尤其是在超长数据集上。重要的是,FMAlign2 通过赋予在可接受的时间内处理长度达到数十亿的序列的能力,增强了现有的 MSA 方法。

可用性和实现

源代码和数据集可在 https://github.com/malabz/FMAlign2 和 https://zenodo.org/records/10435770 上获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/996614a0875c/btae014f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/2ddc89c7c2a2/btae014f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/3ad29f22b17d/btae014f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/996614a0875c/btae014f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/2ddc89c7c2a2/btae014f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/3ad29f22b17d/btae014f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/10d4/10809904/996614a0875c/btae014f3.jpg

相似文献

1
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.
2
A novel fast multiple nucleotide sequence alignment method based on FM-index.基于 FM-index 的新型快速多核苷酸序列比对方法。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab519.
3
WMSA: a novel method for multiple sequence alignment of DNA sequences.WMSA:一种用于 DNA 序列多重序列比对的新方法。
Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.
4
FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots.FAME:一种快速且内存高效的多序列比对工具,通过兼容的根链实现。
Bioinformatics. 2020 Jun 1;36(12):3662-3668. doi: 10.1093/bioinformatics/btaa175.
5
Suffix tree searcher: exploration of common substrings in large DNA sequence sets.后缀树搜索器:大型DNA序列集中常见子串的探索
BMC Res Notes. 2014 Jul 23;7:466. doi: 10.1186/1756-0500-7-466.
6
TPMA: A two pointers meta-alignment tool to ensemble different multiple nucleic acid sequence alignments.TPMA:一种双指针元比对工具,用于集成不同的多个核酸序列比对。
PLoS Comput Biol. 2024 Apr 1;20(4):e1011988. doi: 10.1371/journal.pcbi.1011988. eCollection 2024 Apr.
7
UPP2: fast and accurate alignment of datasets with fragmentary sequences.UPP2:快速准确地对齐具有片段序列的数据集。
Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.
8
CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment.CMSA:一种用于多个相似RNA/DNA序列比对的异构CPU/GPU计算系统。
BMC Bioinformatics. 2017 Jun 24;18(1):315. doi: 10.1186/s12859-017-1725-6.
9
Recursive MAGUS: Scalable and accurate multiple sequence alignment.递归 MAGUS:可扩展且精确的多重序列比对。
PLoS Comput Biol. 2021 Oct 6;17(10):e1008950. doi: 10.1371/journal.pcbi.1008950. eCollection 2021 Oct.
10
ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes.病毒 MSA:大规模可扩展的基于参考的病毒基因组多重序列比对。
Bioinformatics. 2021 May 5;37(5):714-716. doi: 10.1093/bioinformatics/btaa743.

引用本文的文献

1
Generating multiple alignments on a pangenomic scale.在泛基因组规模上生成多个比对。
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf104.
2
FORAlign: accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on Four Russians approach with linear space complexity.FORAlign:基于四俄罗斯人方法,利用FOR块加速具有线性空间复杂度的间隙仿射DNA双序列比对。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf061.
3
Fast sequence alignment for centromeres with RaMA.使用RaMA对着丝粒进行快速序列比对。

本文引用的文献

1
WMSA: a novel method for multiple sequence alignment of DNA sequences.WMSA:一种用于 DNA 序列多重序列比对的新方法。
Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.
2
HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.HAlign 3:快速对齐超大量相似 DNA/RNA 序列。
Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.
3
A survey on the algorithm and development of multiple sequence alignment.多序列比对算法与发展研究综述。
Genome Res. 2025 May 2;35(5):1209-1218. doi: 10.1101/gr.279763.124.
4
ReAlign-N: an integrated realignment approach for multiple nucleic acid sequence alignment, combining global and local realignments.ReAlign-N:一种用于多核酸序列比对的综合重排方法,结合了全局和局部重排。
NAR Genom Bioinform. 2024 Dec 18;6(4):lqae170. doi: 10.1093/nargab/lqae170. eCollection 2024 Dec.
5
HAlign 4: a new strategy for rapidly aligning millions of sequences.HAlign 4:一种快速比对数百万条序列的新策略。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.
6
miniSNV: accurate and fast single nucleotide variant calling from nanopore sequencing data.miniSNV:从纳米孔测序数据中进行准确快速的单核苷酸变异calling。
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae473.
7
Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。
Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.
Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac069.
4
A novel fast multiple nucleotide sequence alignment method based on FM-index.基于 FM-index 的新型快速多核苷酸序列比对方法。
Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab519.
5
gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections.gsufsort:为字符串集合构建后缀数组、最长公共前缀数组和Burrows-Wheeler变换
Algorithms Mol Biol. 2020 Sep 22;15:18. doi: 10.1186/s13015-020-00177-y. eCollection 2020.
6
FAME: fast and memory efficient multiple sequences alignment tool through compatible chain of roots.FAME:一种快速且内存高效的多序列比对工具,通过兼容的根链实现。
Bioinformatics. 2020 Jun 1;36(12):3662-3668. doi: 10.1093/bioinformatics/btaa175.
7
Earth BioGenome Project: Sequencing life for the future of life.地球生物基因组计划:为生命的未来测序生命。
Proc Natl Acad Sci U S A. 2018 Apr 24;115(17):4325-4333. doi: 10.1073/pnas.1720115115.
8
MUMmer4: A fast and versatile genome alignment system.MUMmer4:一种快速且通用的基因组比对系统。
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.
9
HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.HAlign:基于中心星型策略的快速多重相似DNA/RNA序列比对
Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.
10
The top 100 papers.前100篇论文。
Nature. 2014 Oct 30;514(7524):550-3. doi: 10.1038/514550a.