Bawono Punto, van der Velde Arjan, Abeln Sanne, Heringa Jaap
Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands.
Centre for Integrative Bioinformatics (IBIVU), VU University Amsterdam, Amsterdam, The Netherlands; Amsterdam Institute for Molecules Medicines and Systems (AIMMS), VU University Amsterdam, Amsterdam, The Netherlands.
PLoS One. 2015 May 19;10(5):e0127431. doi: 10.1371/journal.pone.0127431. eCollection 2015.
Multiple Sequence Alignment (MSA) methods are typically benchmarked on sets of reference alignments. The quality of the alignment can then be represented by the sum-of-pairs (SP) or column (CS) scores, which measure the agreement between a reference and corresponding query alignment. Both the SP and CS scores treat mismatches between a query and reference alignment as equally bad, and do not take the separation into account between two amino acids in the query alignment, that should have been matched according to the reference alignment. This is significant since the magnitude of alignment shifts is often of relevance in biological analyses, including homology modeling and MSA refinement/manual alignment editing. In this study we develop a new alignment benchmark scoring scheme, SPdist, that takes the degree of discordance of mismatches into account by measuring the sequence distance between mismatched residue pairs in the query alignment. Using this new score along with the standard SP score, we investigate the discriminatory behavior of the new score by assessing how well six different MSA methods perform with respect to BAliBASE reference alignments. The SP score and the SPdist score yield very similar outcomes when the reference and query alignments are close. However, for more divergent reference alignments the SPdist score is able to distinguish between methods that keep alignments approximately close to the reference and those exhibiting larger shifts. We observed that by using SPdist together with SP scoring we were able to better delineate the alignment quality difference between alternative MSA methods. With a case study we exemplify why it is important, from a biological perspective, to consider the separation of mismatches. The SPdist scoring scheme has been implemented in the VerAlign web server (http://www.ibi.vu.nl/programs/veralignwww/). The code for calculating SPdist score is also available upon request.
多序列比对(MSA)方法通常在参考比对集上进行基准测试。然后,比对的质量可以用双序列和(SP)或列(CS)分数来表示,这些分数衡量参考比对和相应查询比对之间的一致性。SP分数和CS分数都将查询比对与参考比对之间的错配视为同样糟糕,并且没有考虑查询比对中两个氨基酸之间的间隔,而根据参考比对这两个氨基酸应该是匹配的。这一点很重要,因为比对偏移的幅度在包括同源建模和MSA优化/手动比对编辑在内的生物学分析中通常具有相关性。在本研究中,我们开发了一种新的比对基准评分方案SPdist,它通过测量查询比对中错配残基对之间的序列距离来考虑错配的不一致程度。使用这个新分数以及标准的SP分数,我们通过评估六种不同的MSA方法相对于BAliBASE参考比对的表现来研究新分数的区分行为。当参考比对和查询比对接近时,SP分数和SPdist分数产生非常相似的结果。然而,对于差异更大的参考比对,SPdist分数能够区分那些使比对大致接近参考比对的方法和那些表现出更大偏移的方法。我们观察到,通过将SPdist与SP评分一起使用,我们能够更好地描绘替代MSA方法之间的比对质量差异。通过一个案例研究,我们举例说明了从生物学角度考虑错配间隔为何很重要。SPdist评分方案已在VerAlign网络服务器(http://www.ibi.vu.nl/programs/veralignwww/)中实现。计算SPdist分数的代码也可根据要求提供。