Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i328-i336. doi: 10.1093/bioinformatics/btae225.
Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences.
The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.
多序列比对是计算生物学中的一个重要问题,其应用包括系统发育和蛋白质序列之间远程同源性的检测。UPP 是一个流行的软件包,它基于隐马尔可夫模型(HMM)的集合为大型数据集构建准确的多序列比对。该方法的计算瓶颈是序列到 HMM 的分配步骤,该步骤依赖于 HMM 上概率得分的精确计算。在这项工作中,我们表明,我们可以通过用可以有效估计的替代分数来替换这些 HMM 概率分数,从而大大加快这个分配步骤。我们提出的方法利用多臂老虎机算法自适应且高效地计算这些分数的估计值。这使我们能够以显著减少计算时间的方式实现与 UPP 相似的对齐精度,特别是对于具有长序列的数据集。
https://github.com/ilanshom/adaptiveMSA 上提供了用于生成本文结果的代码。