Ahola Virpi, Aittokallio Tero, Vihinen Mauno, Uusipaikka Esa
Biotechnology and Food Research, MTT Agrifood Research Finland, Jokioinen, Finland.
BMC Bioinformatics. 2006 Nov 3;7:484. doi: 10.1186/1471-2105-7-484.
Multiple sequence alignment is the foundation of many important applications in bioinformatics that aim at detecting functionally important regions, predicting protein structures, building phylogenetic trees etc. Although the automatic construction of a multiple sequence alignment for a set of remotely related sequences cause a very challenging and error-prone task, many downstream analyses still rely heavily on the accuracy of the alignments.
To address the need for an objective evaluation framework, we introduce a statistical score that assesses the quality of a given multiple sequence alignment. The quality assessment is based on counting the number of significantly conserved positions in the alignment using importance sampling method in conjunction with statistical profile analysis framework. We first evaluate a novel objective function used in the alignment quality score for measuring the positional conservation. The results for the Src homology 2 (SH2) domain, Ras-like proteins, peptidase M13, subtilase and beta-lactamase families demonstrate that the score can distinguish sequence patterns with different degrees of conservation. Secondly, we evaluate the quality of the alignments produced by several widely used multiple sequence alignment programs using a novel alignment quality score and a commonly used sum of pairs method. According to these results, the Mafft strategy L-INS-i outperforms the other methods, although the difference between the Probcons, TCoffee and Muscle is mostly insignificant. The novel alignment quality score provides similar results than the sum of pairs method.
The results indicate that the proposed statistical score is useful in assessing the quality of multiple sequence alignments.
多序列比对是生物信息学中许多重要应用的基础,这些应用旨在检测功能重要区域、预测蛋白质结构、构建系统发育树等。尽管为一组远缘相关序列自动构建多序列比对是一项极具挑战性且容易出错的任务,但许多下游分析仍然严重依赖比对的准确性。
为满足对客观评估框架的需求,我们引入了一种统计得分来评估给定多序列比对的质量。质量评估基于使用重要性抽样方法结合统计概况分析框架来计算比对中显著保守位置的数量。我们首先评估用于比对质量得分以测量位置保守性的一种新型目标函数。Src同源2(SH2)结构域、类Ras蛋白、肽酶M13、枯草杆菌蛋白酶和β-内酰胺酶家族的结果表明,该得分可以区分不同保守程度的序列模式。其次,我们使用一种新型比对质量得分和常用的双序列比对得分总和方法,评估了几种广泛使用的多序列比对程序产生的比对质量。根据这些结果,Mafft策略L-INS-i优于其他方法,尽管Probcons、TCoffee和Muscle之间的差异大多不显著。新型比对质量得分与双序列比对得分总和方法提供了相似的结果。
结果表明,所提出的统计得分在评估多序列比对质量方面是有用的。