Department of Computer Science, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.
PLoS Comput Biol. 2018 Nov 1;14(11):e1006547. doi: 10.1371/journal.pcbi.1006547. eCollection 2018 Nov.
Protein or DNA motifs are sequence regions which possess biological importance. These regions are often highly conserved among homologous sequences. The generation of multiple sequence alignments (MSAs) with a correct alignment of the conserved sequence motifs is still difficult to achieve, due to the fact that the contribution of these typically short fragments is overshadowed by the rest of the sequence. Here we extended the PRALINE multiple sequence alignment program with a novel motif-aware MSA algorithm in order to address this shortcoming. This method can incorporate explicit information about the presence of externally provided sequence motifs, which is then used in the dynamic programming step by boosting the amino acid substitution matrix towards the motif. The strength of the boost is controlled by a parameter, α. Using a benchmark set of alignments we confirm that a good compromise can be found that improves the matching of motif regions while not significantly reducing the overall alignment quality. By estimating α on an unrelated set of reference alignments we find there is indeed a strong conservation signal for motifs. A number of typical but difficult MSA use cases are explored to exemplify the problems in correctly aligning functional sequence motifs and how the motif-aware alignment method can be employed to alleviate these problems.
蛋白质或 DNA 基序是具有生物学重要性的序列区域。这些区域在同源序列之间通常高度保守。由于这些通常较短的片段的贡献被序列的其余部分所掩盖,因此仍然难以实现具有保守序列基序正确对齐的多个序列比对 (MSA)。为了解决这个问题,我们在 PRALINE 多序列比对程序中扩展了一种新的基于基序的 MSA 算法。该方法可以合并关于外部提供的序列基序存在的显式信息,然后在动态规划步骤中通过将氨基酸替换矩阵向基序增强来使用该信息。增强的强度由参数 α 控制。使用一组对齐基准集,我们确认可以找到一个良好的折衷方案,在不显著降低整体对齐质量的情况下提高基地区域的匹配程度。通过在一组不相关的参考对齐集上估计 α,我们发现基序确实存在强烈的保守信号。探索了一些典型但困难的 MSA 用例,以说明正确对齐功能序列基序的问题,以及如何使用基于基序的对齐方法来缓解这些问题。