Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
Bioinformatics. 2021 Jul 19;37(12):1666-1672. doi: 10.1093/bioinformatics/btaa992.
The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset.
We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets.
MAGUS: https://github.com/vlasmirnov/MAGUS.
Supplementary data are available at Bioinformatics online.
大规模多序列比对(MSA)的估计是一个基本的生物信息学挑战。分而治之是一种有用的方法,已经证明它可以提高 SATé 和 PASTA 等既定方法中 MSA 估计的可扩展性和准确性。在这些分而治之的策略中,序列数据集被分为不相交的子集,使用基础 MSA 方法(例如 MAFFT)在子集中计算比对,然后合并为全数据集上的比对。
我们提出了 MAGUS,即使用图聚类的多序列比对,这是一种用于计算大规模比对的新技术。MAGUS 与 PASTA 相似,因为它使用几乎相同的初始步骤(起始树、类似的分解策略和 MAFFT 来计算子集比对),但随后使用图聚类合并器合并子集比对,这是我们在本研究中提出的一种新方法用于组合不相交的比对。我们在一组异构的生物和模拟数据集上的研究表明,MAGUS 在大型数据集上比 PASTA 产生更高的准确性和更快的速度,并且在较小的数据集上与之匹配。
MAGUS:https://github.com/vlasmirnov/MAGUS。
补充数据可在 Bioinformatics 在线获得。