• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

HAlign 4:一种快速比对数百万条序列的新策略。

HAlign 4: a new strategy for rapidly aligning millions of sequences.

作者信息

Zhou Tong, Zhang Pinglu, Zou Quan, Han Wu

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China.

出版信息

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.

DOI:10.1093/bioinformatics/btae718
PMID:39607773
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11646084/
Abstract

MOTIVATION

HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.

RESULTS

To address this issue, we have implemented HAlign4 in C++. In this version, we replaced the original suffix tree with Burrows-Wheeler Transform and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million coronavirus disease 2019 (COVID-19) sequences in about 12 min and 300 GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503.

摘要

动机

HAlign是一款基于星型比对策略的高性能多序列比对软件,是快速比对大量序列的首选工具。用Java实现的HAlign3是能够比对超多相似DNA/RNA序列的最新版本。然而,HAlign3在处理长序列和极大量序列时仍存在困难。

结果

为解决此问题,我们用C++实现了HAlign4。在这个版本中,我们用布隆斯-惠勒变换(Burrows-Wheeler Transform)取代了原来的后缀树,并引入了波前比对算法以进一步优化时间和内存效率。实验表明,在单线程和多线程配置下,HAlign4在运行时间和内存使用方面均显著优于HAlign3,同时保持了与MAFFT相当的高比对精度。HAlign4使用96个线程,大约12分钟就能完成1000万个2019冠状病毒病(COVID-19)序列的比对,且仅需300GB内存,证明了其在标准工作站上进行大规模比对的效率和实用性。

可用性与实现方式

源代码可在https://github.com/malabz/HAlign-4获取,数据集可在https://zenodo.org/records/13934503获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/bccafea37f3e/btae718f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/17a68e25f5a6/btae718f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/438ef26ed608/btae718f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/bccafea37f3e/btae718f3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/17a68e25f5a6/btae718f1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/438ef26ed608/btae718f2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/bccafea37f3e/btae718f3.jpg

相似文献

1
HAlign 4: a new strategy for rapidly aligning millions of sequences.HAlign 4:一种快速比对数百万条序列的新策略。
Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.
2
HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.HAlign 3:快速对齐超大量相似 DNA/RNA 序列。
Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.
3
WMSA: a novel method for multiple sequence alignment of DNA sequences.WMSA:一种用于 DNA 序列多重序列比对的新方法。
Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.
4
HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy.HAlign:基于中心星型策略的快速多重相似DNA/RNA序列比对
Bioinformatics. 2015 Aug 1;31(15):2475-81. doi: 10.1093/bioinformatics/btv177. Epub 2015 Mar 25.
5
WMSA 2: a multiple DNA/RNA sequence alignment tool implemented with accurate progressive mode and a fast win-win mode combining the center star and progressive strategies.WMSA 2:一种采用精确渐进模式和快速双赢模式(结合中心星和渐进策略)的多 DNA/RNA 序列比对工具。
Brief Bioinform. 2023 Jul 20;24(4). doi: 10.1093/bib/bbad190.
6
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.
7
SPARK-MSNA: Efficient algorithm on Apache Spark for aligning multiple similar DNA/RNA sequences with supervised learning.SPARK-MSNA:基于 Apache Spark 的高效算法,用于通过有监督学习对齐多个相似的 DNA/RNA 序列。
Sci Rep. 2019 Apr 29;9(1):6631. doi: 10.1038/s41598-019-42966-5.
8
QuickEd: high-performance exact sequence alignment based on bound-and-align.QuickEd:基于绑定与比对的高性能精确序列比对
Bioinformatics. 2025 Mar 4;41(3). doi: 10.1093/bioinformatics/btaf112.
9
AltaiR: a C toolkit for alignment-free and temporal analysis of multi-FASTA data.AltaiR:一个用于多 FASTA 数据无比对和时间分析的 C 工具包。
Gigascience. 2024 Jan 2;13. doi: 10.1093/gigascience/giae086.
10
FORAlign: accelerating gap-affine DNA pairwise sequence alignment using FOR-blocks based on Four Russians approach with linear space complexity.FORAlign:基于四俄罗斯人方法,利用FOR块加速具有线性空间复杂度的间隙仿射DNA双序列比对。
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbaf061.

引用本文的文献

1
ReAlign-P: a vertical iterative realignment method for protein multiple sequence alignment.ReAlign-P:一种用于蛋白质多序列比对的垂直迭代重排方法。
Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf421.
2
Fast sequence alignment for centromeres with RaMA.使用RaMA对着丝粒进行快速序列比对。
Genome Res. 2025 May 2;35(5):1209-1218. doi: 10.1101/gr.279763.124.

本文引用的文献

1
Application and Comparison of Machine Learning and Database-Based Methods in Taxonomic Classification of High-Throughput Sequencing Data.基于机器学习和数据库的方法在高通量测序数据分类中的应用与比较。
Genome Biol Evol. 2024 May 2;16(5). doi: 10.1093/gbe/evae102.
2
TPMA: A two pointers meta-alignment tool to ensemble different multiple nucleic acid sequence alignments.TPMA:一种双指针元比对工具,用于集成不同的多个核酸序列比对。
PLoS Comput Biol. 2024 Apr 1;20(4):e1011988. doi: 10.1371/journal.pcbi.1011988. eCollection 2024 Apr.
3
FMAlign2: a novel fast multiple nucleotide sequence alignment method for ultralong datasets.
FMAlign2:一种新颖的快速多核苷酸序列比对方法,适用于超大数据集。
Bioinformatics. 2024 Jan 2;40(1). doi: 10.1093/bioinformatics/btae014.
4
Optimal gap-affine alignment in O(s) space.最优间隙仿射对齐,时间复杂度为 O(s)。
Bioinformatics. 2023 Feb 3;39(2). doi: 10.1093/bioinformatics/btad074.
5
MPoxVR: A comprehensive genomic resource for monkeypox virus variant surveillance.猴痘病毒变异监测综合基因组资源(MPoxVR)
Innovation (Camb). 2022 Aug 1;3(5):100296. doi: 10.1016/j.xinn.2022.100296. eCollection 2022 Sep 13.
6
HAlign 3: Fast Multiple Alignment of Ultra-Large Numbers of Similar DNA/RNA Sequences.HAlign 3:快速对齐超大量相似 DNA/RNA 序列。
Mol Biol Evol. 2022 Aug 3;39(8). doi: 10.1093/molbev/msac166.
7
Fast gap-affine pairwise alignment using the wavefront algorithm.基于波前算法的快速间隙亲和双序列比对。
Bioinformatics. 2021 May 1;37(4):456-463. doi: 10.1093/bioinformatics/btaa777.
8
Large multiple sequence alignments with a root-to-leaf regressive method.使用根到叶回溯方法的大型多重序列比对。
Nat Biotechnol. 2019 Dec;37(12):1466-1470. doi: 10.1038/s41587-019-0333-6. Epub 2019 Dec 2.
9
MUMmer4: A fast and versatile genome alignment system.MUMmer4:一种快速且通用的基因组比对系统。
PLoS Comput Biol. 2018 Jan 26;14(1):e1005944. doi: 10.1371/journal.pcbi.1005944. eCollection 2018 Jan.
10
HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing.HAlign-II:利用分布式和并行计算实现高效的超大倍数序列比对及系统发育树重建
Algorithms Mol Biol. 2017 Sep 29;12:25. doi: 10.1186/s13015-017-0116-x. eCollection 2017.