UPP2：快速准确地对齐具有片段序列的数据集。

UPP2: fast and accurate alignment of datasets with fragmentary sequences.

机构信息

Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA.

出版信息

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad007.

DOI:10.1093/bioinformatics/btad007

PMID:36625535

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9846425/

Abstract

MOTIVATION

Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets.

RESULTS

We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise.

AVAILABILITY AND IMPLEMENTATION

https://github.com/gillichu/sepp.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

多序列比对（MSA）是许多生物信息学流程的基本步骤。然而，在大型数据集上实现高度准确的比对，特别是那些具有序列长度异质性的数据集，是一项具有挑战性的任务。使用 Phylogeny-aware Profiles（UPP）进行超大型多序列比对是一种 MSA 估计方法，它构建了一个隐马尔可夫模型（HMM）的集合来表示输入的全长序列上的估计比对，然后使用集合中的选定 HMM 将其余序列添加到比对中。尽管 UPP 提供了很好的准确性，但在大型数据集上计算量很大。