Kruspe Matthias, Stadler Peter F
Bioinformatics Group, Department of Computer Science and Interdisciplinary Center for Bioinformatics, University of Leipzig, Leipzig, Germany.
BMC Bioinformatics. 2007 Jul 15;8:254. doi: 10.1186/1471-2105-8-254.
The quality of progressive sequence alignments strongly depends on the accuracy of the individual pairwise alignment steps since gaps that are introduced at one step cannot be removed at later aggregation steps. Adjacent insertions and deletions necessarily appear in arbitrary order in pairwise alignments and hence form an unavoidable source of errors.
Here we present a modified variant of progressive sequence alignments that addresses both issues. Instead of pairwise alignments we use exact dynamic programming to align sequence or profile triples. This avoids a large fractions of the ambiguities arising in pairwise alignments. In the subsequent aggregation steps we follow the logic of the Neighbor-Net algorithm, which constructs a phylogenetic network by step-wisely replacing triples by pairs instead of combining pairs to singletons. To this end the three-way alignments are subdivided into two partial alignments, at which stage all-gap columns are naturally removed. This alleviates the "once a gap, always a gap" problem of progressive alignment procedures.
The three-way Neighbor-Net based alignment program aln3nn is shown to compare favorably on both protein sequences and nucleic acids sequences to other progressive alignment tools. In the latter case one easily can include scoring terms that consider secondary structure features. Overall, the quality of resulting alignments in general exceeds that of clustalw or other multiple alignments tools even though our software does not included heuristics for context dependent (mis)match scores.
渐进式序列比对的质量在很大程度上取决于各个两两比对步骤的准确性,因为在某一步骤引入的空位无法在后续的合并步骤中消除。相邻的插入和缺失在两两比对中必然以任意顺序出现,因此构成了不可避免的错误来源。
在此,我们提出了一种改进的渐进式序列比对变体,可解决这两个问题。我们使用精确动态规划而非两两比对来比对序列或序列谱三元组。这避免了两两比对中出现的大部分模糊性。在后续的合并步骤中,我们遵循邻接网络算法的逻辑,该算法通过逐步用二元组替换三元组而非将二元组合并为单元素来构建系统发育网络。为此,将三元比对细分为两个部分比对,在此阶段自然会去除全空位列。这缓解了渐进比对程序的“一旦有空位,始终有空位”问题。
基于三元邻接网络的比对程序aln3nn在蛋白质序列和核酸序列方面均显示出优于其他渐进比对工具的性能。在后一种情况下,可以轻松纳入考虑二级结构特征的评分项。总体而言,尽管我们的软件未包含针对上下文相关(错)配分数的启发式算法,但所得比对的质量通常超过了clustalw或其他多重比对工具。