Institute of Biotechnology, University of Helsinki, P,O, Box 56, Viikinkaari 5, Helsinki, Finland.
BMC Bioinformatics. 2012 Apr 29;13:64. doi: 10.1186/1471-2105-13-64.
Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark.
Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal.
This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).
蛋白质序列比对(MPSA)是分子生物学中众多应用的起点。在这里,我们基于 SeqAn 序列比对库展示了一个新的 MPSA 程序。我们的实现具有严格的模块化结构,允许交换对齐过程的不同组件,从而研究它们对对齐质量和计算时间的贡献。我们系统地改变了信息源、引导树、评分转换和迭代细化选项,并在 BAliBASE 和 SABmark 上评估了得到的比对结果。
我们的结果基于比较选择了最优的对齐策略。首先,我们表明成对全局和局部比对包含构建高质量多重比对的足够信息。其次,单链接聚类几乎总是构建渐进比对引导树的最佳算法。第三,与比较的一致性转换相比,三联体库扩展(引入新边)是最有效的。或者,可以应用树相关分区作为后处理步骤,其在时间和准确性方面都可与最佳一致性转换相媲美。最后,传递超过四个传递链接的信息引入的噪声多于信号。
这是第一次使用单个实现平台全面、清晰地比较多种蛋白质对齐策略。特别是,我们展示了哪些现有的一致性转换和迭代细化技术是最有效的。我们的实现可在 http://ekhidna.biocenter.helsinki.fi/MMSA 上免费获取,并作为本文的附加文件(见附加文件 1)提供。