Warnow Tandy, Mirarab Siavash
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA.
Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA, USA.
Methods Mol Biol. 2021;2231:99-119. doi: 10.1007/978-1-0716-1036-7_7.
The estimation of very large multiple sequence alignments is a challenging problem that requires special techniques in order to achieve high accuracy. Here we describe two software packages-PASTA and UPP-for constructing alignments on large and ultra-large datasets. Both methods have been able to produce highly accurate alignments on 1,000,000 sequences, and trees computed on these alignments are also highly accurate. PASTA provides the best tree accuracy when the input sequences are all full-length, but UPP provides improved accuracy compared to PASTA and other methods when the input contains a large number of fragmentary sequences. Both methods are available in open source form on GitHub.
估计非常大的多序列比对是一个具有挑战性的问题,需要特殊技术才能实现高精度。在这里,我们描述了两个软件包——PASTA和UPP——用于在大型和超大型数据集上构建比对。这两种方法都能够在100万个序列上生成高度准确的比对,并且基于这些比对计算出的树也高度准确。当输入序列都是全长时,PASTA提供了最佳的树准确性,但是当输入包含大量片段序列时,与PASTA和其他方法相比,UPP提供了更高的准确性。这两种方法都可以在GitHub上以开源形式获取。