Suppr超能文献

HAlign 4:一种快速比对数百万条序列的新策略。

HAlign 4: a new strategy for rapidly aligning millions of sequences.

作者信息

Zhou Tong, Zhang Pinglu, Zou Quan, Han Wu

机构信息

Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan 610054, China.

Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang 324003, China.

出版信息

Bioinformatics. 2024 Nov 28;40(12). doi: 10.1093/bioinformatics/btae718.

Abstract

MOTIVATION

HAlign is a high-performance multiple sequence alignment software based on the star alignment strategy, which is the preferred choice for rapidly aligning large numbers of sequences. HAlign3, implemented in Java, is the latest version capable of aligning an ultra-large number of similar DNA/RNA sequences. However, HAlign3 still struggles with long sequences and extremely large numbers of sequences.

RESULTS

To address this issue, we have implemented HAlign4 in C++. In this version, we replaced the original suffix tree with Burrows-Wheeler Transform and introduced the wavefront alignment algorithm to further optimize both time and memory efficiency. Experiments show that HAlign4 significantly outperforms HAlign3 in runtime and memory usage in both single-threaded and multi-threaded configurations, while maintains high alignment accuracy comparable to MAFFT. HAlign4 can complete the alignment of 10 million coronavirus disease 2019 (COVID-19) sequences in about 12 min and 300 GB of memory using 96 threads, demonstrating its efficiency and practicality for large-scale alignment on standard workstations.

AVAILABILITY AND IMPLEMENTATION

Source code is available at https://github.com/malabz/HAlign-4, dataset is available at https://zenodo.org/records/13934503.

摘要

动机

HAlign是一款基于星型比对策略的高性能多序列比对软件,是快速比对大量序列的首选工具。用Java实现的HAlign3是能够比对超多相似DNA/RNA序列的最新版本。然而,HAlign3在处理长序列和极大量序列时仍存在困难。

结果

为解决此问题,我们用C++实现了HAlign4。在这个版本中,我们用布隆斯-惠勒变换(Burrows-Wheeler Transform)取代了原来的后缀树,并引入了波前比对算法以进一步优化时间和内存效率。实验表明,在单线程和多线程配置下,HAlign4在运行时间和内存使用方面均显著优于HAlign3,同时保持了与MAFFT相当的高比对精度。HAlign4使用96个线程,大约12分钟就能完成1000万个2019冠状病毒病(COVID-19)序列的比对,且仅需300GB内存,证明了其在标准工作站上进行大规模比对的效率和实用性。

可用性与实现方式

源代码可在https://github.com/malabz/HAlign-4获取,数据集可在https://zenodo.org/records/13934503获取。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4720/11646084/17a68e25f5a6/btae718f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验