MAGUS：基于图聚类的多重序列比对。

MAGUS: Multiple sequence Alignment using Graph clUStering.

机构信息

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.

出版信息

Bioinformatics. 2021 Jul 19;37(12):1666-1672. doi: 10.1093/bioinformatics/btaa992.

DOI:10.1093/bioinformatics/btaa992

PMID:33252662

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8289385/

Abstract

MOTIVATION

The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset.

RESULTS

We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets.

AVAILABILITY AND IMPLEMENTATION

MAGUS: https://github.com/vlasmirnov/MAGUS.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

大规模多序列比对（MSA）的估计是一个基本的生物信息学挑战。分而治之是一种有用的方法，已经证明它可以提高 SATé 和 PASTA 等既定方法中 MSA 估计的可扩展性和准确性。在这些分而治之的策略中，序列数据集被分为不相交的子集，使用基础 MSA 方法（例如 MAFFT）在子集中计算比对，然后合并为全数据集上的比对。

结果

我们提出了 MAGUS，即使用图聚类的多序列比对，这是一种用于计算大规模比对的新技术。MAGUS 与 PASTA 相似，因为它使用几乎相同的初始步骤（起始树、类似的分解策略和 MAFFT 来计算子集比对），但随后使用图聚类合并器合并子集比对，这是我们在本研究中提出的一种新方法用于组合不相交的比对。我们在一组异构的生物和模拟数据集上的研究表明，MAGUS 在大型数据集上比 PASTA 产生更高的准确性和更快的速度，并且在较小的数据集上与之匹配。

可用性和实现

MAGUS：https://github.com/vlasmirnov/MAGUS。

补充信息

补充数据可在 Bioinformatics 在线获得。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ca16/8289385/a4478e062e71/btaa992f1.jpg

相似文献

MAGUS: Multiple sequence Alignment using Graph clUStering.

Bioinformatics. 2021 Jul 19;37(12):1666-1672. doi: 10.1093/bioinformatics/btaa992.

Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem.

IEEE/ACM Trans Comput Biol Bioinform. 2023 May-Jun;20(3):1700-1712. doi: 10.1109/TCBB.2022.3191848. Epub 2023 Jun 5.

PASTA for proteins.

Bioinformatics. 2018 Nov 15;34(22):3939-3941. doi: 10.1093/bioinformatics/bty495.

MAGUS+eHMMs: improved multiple sequence alignment accuracy for fragmentary sequences.

Bioinformatics. 2022 Jan 27;38(4):918-924. doi: 10.1093/bioinformatics/btab788.

Recursive MAGUS: Scalable and accurate multiple sequence alignment.

PLoS Comput Biol. 2021 Oct 6;17(10):e1008950. doi: 10.1371/journal.pcbi.1008950. eCollection 2021 Oct.

WMSA: a novel method for multiple sequence alignment of DNA sequences.

Bioinformatics. 2022 Nov 15;38(22):5019-5025. doi: 10.1093/bioinformatics/btac658.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

DACTAL: divide-and-conquer trees (almost) without alignments.

Bioinformatics. 2012 Jun 15;28(12):i274-82. doi: 10.1093/bioinformatics/bts218.

PASTASpark: multiple sequence alignment meets Big Data.

Bioinformatics. 2017 Sep 15;33(18):2948-2950. doi: 10.1093/bioinformatics/btx354.

Multiple Sequence Alignment for Large Heterogeneous Datasets Using SATé, PASTA, and UPP.

Methods Mol Biol. 2021;2231:99-119. doi: 10.1007/978-1-0716-1036-7_7.

引用本文的文献

The Evolutionary History and Modern Diversity of Triterpenoid Cyclases.

Mol Biol Evol. 2025 Sep 1;42(9). doi: 10.1093/molbev/msaf203.

Environmental adaptations in metagenomes revealed by deep learning.

BMC Biol. 2025 Aug 11;23(1):252. doi: 10.1186/s12915-025-02361-1.

The evolutionary history and modern diversity of triterpenoid cyclases.

bioRxiv. 2025 Aug 2:2024.10.28.620730. doi: 10.1101/2024.10.28.620730.

An nf-core framework for the systematic comparison of alternative modeling tools: the multiple sequence alignment case study.

NAR Genom Bioinform. 2025 Jul 31;7(3):lqaf104. doi: 10.1093/nargab/lqaf104. eCollection 2025 Sep.

Ultrafast and ultralarge multiple sequence alignments using TWILIGHT.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i332-i341. doi: 10.1093/bioinformatics/btaf212.

TIPP3 and TIPP3-fast: Improved abundance profiling in metagenomics.

PLoS Comput Biol. 2025 Apr 4;21(4):e1012593. doi: 10.1371/journal.pcbi.1012593. eCollection 2025 Apr.

Toward a phylogenomic classification of magnoliids.

Am J Bot. 2025 Jan;112(1):e16451. doi: 10.1002/ajb2.16451. Epub 2025 Jan 14.

Evolution of Sex-linked Genes and the Role of Pericentromeric Regions in Sex Chromosomes: Insights from Diploid Willows.

Mol Biol Evol. 2024 Nov 1;41(11). doi: 10.1093/molbev/msae235.

learnMSA2: deep protein multiple alignments with large language and hidden Markov models.

Bioinformatics. 2024 Sep 1;40(Suppl 2):ii79-ii86. doi: 10.1093/bioinformatics/btae381.

The hidden oases: unveiling trophic dynamics in Namib's fog plant ecosystem.

Sci Rep. 2024 Jun 10;14(1):13334. doi: 10.1038/s41598-024-61796-8.

本文引用的文献

Large multiple sequence alignments with a root-to-leaf regressive method.

Nat Biotechnol. 2019 Dec;37(12):1466-1470. doi: 10.1038/s41587-019-0333-6. Epub 2019 Dec 2.

Kalign 3: multiple sequence alignment of large data sets.

Bioinformatics. 2019 Oct 26;36(6):1928-9. doi: 10.1093/bioinformatics/btz795.

Ultra-large alignments using phylogeny-aware profiles.

Genome Biol. 2015 Jun 16;16(1):124. doi: 10.1186/s13059-015-0688-z.

PASTA: Ultra-Large Multiple Sequence Alignment for Nucleotide and Amino-Acid Sequences.

J Comput Biol. 2015 May;22(5):377-86. doi: 10.1089/cmb.2014.0156. Epub 2014 Dec 30.

SATe-II: very fast and accurate simultaneous estimation of multiple sequence alignments and phylogenetic trees.

Syst Biol. 2012 Jan;61(1):90-106. doi: 10.1093/sysbio/syr095. Epub 2011 Dec 1.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75.

FastSP: linear time calculation of alignment accuracy.

Bioinformatics. 2011 Dec 1;27(23):3250-8. doi: 10.1093/bioinformatics/btr553. Epub 2011 Oct 7.

FastTree 2--approximately maximum-likelihood trees for large alignments.

PLoS One. 2010 Mar 10;5(3):e9490. doi: 10.1371/journal.pone.0009490.

Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees.

Science. 2009 Jun 19;324(5934):1561-4. doi: 10.1126/science.1171243.

Recent developments in the MAFFT multiple sequence alignment program.

Brief Bioinform. 2008 Jul;9(4):286-98. doi: 10.1093/bib/bbn013. Epub 2008 Mar 27.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

MAGUS：基于图聚类的多重序列比对。

MAGUS: Multiple sequence Alignment using Graph clUStering.

机构信息

Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.