Liu Baqiao, Warnow Tandy
Department of Computer Science, University of Illinois Urbana-Champaign, Champaign, IL 61820, USA.
Bioinform Adv. 2023 Mar 6;3(1):vbad024. doi: 10.1093/bioadv/vbad024. eCollection 2023.
Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith-Waterman. Our new method, WITCH-NG (i.e. 'next generation WITCH') achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG.
The datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials.
Supplementary data are available at online.
多序列比对是许多生物信息学流程的基本组成部分,包括系统发育估计、RNA和蛋白质结构预测以及宏基因组序列分析。然而,许多序列数据集存在显著的序列长度异质性,这既是由于序列进化历史中的大量插入和缺失,也是由于输入中包含未组装的读段或组装不完整的序列。已经开发了一些方法,这些方法在比对具有序列长度异质性的数据集时可以达到很高的准确性,UPP是最早获得良好准确性的方法之一,而WITCH是对UPP准确性的最新改进。在本文中,我们展示了如何加速WITCH。我们的改进包括用使用Smith-Waterman的多项式时间精确算法取代WITCH中的一个关键步骤(目前使用启发式搜索执行)。我们的新方法WITCH-NG(即“下一代WITCH”)实现了相同的准确性,但速度要快得多。WITCH-NG可在https://github.com/RuneBlaze/WITCH-NG上获取。
本研究中使用的数据集来自先前的出版物,可在公共存储库中免费获取,如补充材料中所示。
补充数据可在网上获取。