Shen Chengze, Park Minhyuk, Warnow Tandy
Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, USA.
J Comput Biol. 2022 Aug;29(8):782-801. doi: 10.1089/cmb.2021.0585. Epub 2022 May 17.
Accurate multiple sequence alignment is challenging on many data sets, including those that are large, evolve under high rates of evolution, or have sequence length heterogeneity. While substantial progress has been made over the last decade in addressing the first two challenges, sequence length heterogeneity remains a significant issue for many data sets. Sequence length heterogeneity occurs for biological and technological reasons, including large insertions or deletions (indels) that occurred in the evolutionary history relating the sequences, or the inclusion of sequences that are not fully assembled. Ultra-large alignments using Phylogeny-Aware Profiles (UPP) (Nguyen et al. 2015) is one of the most accurate approaches for aligning data sets that exhibit sequence length heterogeneity: it constructs an alignment on the subset of sequences it considers "full-length," represents this "backbone alignment" using an ensemble of hidden Markov models (HMMs), and then adds each remaining sequence into the backbone alignment based on an HMM selected for that sequence from the ensemble. Our new method, WeIghTed Consensus Hmm alignment (WITCH), improves on UPP in three important ways: first, it uses a statistically principled technique to weight and rank the HMMs; second, it uses
HMMs from the ensemble rather than a single HMM; and third, it combines the alignments for each of the selected HMMs using a consensus algorithm that takes the weights into account. We show that this approach provides improved alignment accuracy compared with UPP and other leading alignment methods, as well as improved accuracy for maximum likelihood trees based on these alignments.
准确的多序列比对在许多数据集上都具有挑战性,包括那些规模大、进化速率高或存在序列长度异质性的数据集。虽然在过去十年中,在应对前两个挑战方面取得了重大进展,但序列长度异质性对许多数据集来说仍然是一个重大问题。序列长度异质性的出现有生物学和技术方面的原因,包括在与这些序列相关的进化历史中发生的大的插入或缺失(indels),或者包含未完全组装的序列。使用系统发育感知概况(UPP)(Nguyen等人,2015年)进行超大型比对是比对呈现序列长度异质性的数据集最准确的方法之一:它在其认为是“全长”的序列子集上构建比对,使用一组隐马尔可夫模型(HMM)来表示这个“主干比对”,然后根据从该组中为该序列选择的HMM将每个剩余序列添加到主干比对中。我们的新方法,加权一致HMM比对(WITCH),在三个重要方面对UPP进行了改进:第一,它使用一种基于统计原则的技术来对HMM进行加权和排序;第二,它使用该组中的HMM而不是单个HMM;第三,它使用一种考虑权重的一致算法来组合每个选定HMM的比对。我们表明,与UPP和其他领先的比对方法相比,这种方法提高了比对准确性,并且基于这些比对的最大似然树的准确性也得到了提高。