Stocsits Roman R, Hofacker Ivo L, Fried Claudia, Stadler Peter F
Interdisciplinary Centre for Bioinformatics, University of Leipzig, Haertelstrasse 16-18, D-04107 Leipzig, Germany.
BMC Bioinformatics. 2005 Jun 28;6:160. doi: 10.1186/1471-2105-6-160.
High quality sequence alignments of RNA and DNA sequences are an important prerequisite for the comparative analysis of genomic sequence data. Nucleic acid sequences, however, exhibit a much larger sequence heterogeneity compared to their encoded protein sequences due to the redundancy of the genetic code. It is desirable, therefore, to make use of the amino acid sequence when aligning coding nucleic acid sequences. In many cases, however, only a part of the sequence of interest is translated. On the other hand, overlapping reading frames may encode multiple alternative proteins, possibly with intermittent non-coding parts. Examples are, in particular, RNA virus genomes.
The standard scoring scheme for nucleic acid alignments can be extended to incorporate simultaneously information on translation products in one or more reading frames. Here we present a multiple alignment tool, codaln, that implements a combined nucleic acid plus amino acid scoring model for pairwise and progressive multiple alignments that allows arbitrary weighting for almost all scoring parameters. Resource requirements of codaln are comparable with those of standard tools such as ClustalW.
We demonstrate the applicability of codaln to various biologically relevant types of sequences (bacteriophage Levivirus and Vertebrate Hox clusters) and show that the combination of nucleic acid and amino acid sequence information leads to improved alignments. These, in turn, increase the performance of analysis tools that depend strictly on good input alignments such as methods for detecting conserved RNA secondary structure elements.
RNA和DNA序列的高质量序列比对是基因组序列数据比较分析的重要前提。然而,由于遗传密码的冗余性,核酸序列与其编码的蛋白质序列相比表现出更大的序列异质性。因此,在比对编码核酸序列时利用氨基酸序列是很有必要的。然而,在许多情况下,只有感兴趣序列的一部分被翻译。另一方面,重叠阅读框可能编码多种替代蛋白质,可能带有间歇性的非编码部分。特别是RNA病毒基因组就是这样的例子。
核酸比对的标准评分方案可以扩展,以便同时纳入一个或多个阅读框中翻译产物的信息。在此,我们展示了一种多重比对工具codaln,它为成对比对和渐进式多重比对实现了一种核酸加氨基酸的组合评分模型,该模型允许对几乎所有评分参数进行任意加权。codaln的资源需求与标准工具(如ClustalW)相当。
我们证明了codaln适用于各种生物学相关类型的序列(噬菌体细小病毒属和脊椎动物Hox簇),并表明核酸和氨基酸序列信息的结合能带来更好的比对。反过来,这又提高了那些严格依赖良好输入比对的分析工具的性能,比如用于检测保守RNA二级结构元件的方法。