Wilm Andreas, Mainz Indra, Steger Gerhard
Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr, 1, 40225 Düsseldorf, Germany.
Algorithms Mol Biol. 2006 Oct 24;1:19. doi: 10.1186/1748-7188-1-19.
The performance of alignment programs is traditionally tested on sets of protein sequences, of which a reference alignment is known. Conclusions drawn from such protein benchmarks do not necessarily hold for the RNA alignment problem, as was demonstrated in the first RNA alignment benchmark published so far. For example, the twilight zone - the similarity range where alignment quality drops drastically - starts at 60 % for RNAs in comparison to 20 % for proteins. In this study we enhance the previous benchmark.
The RNA sequence sets in the benchmark database are taken from an increased number of RNA families to avoid unintended impact by using only a few families. The size of sets varies from 2 to 15 sequences to assess the influence of the number of sequences on program performance. Alignment quality is scored by two measures: one takes into account only nucleotide matches, the other measures structural conservation. The performance order of parameters--like nucleotide substitution matrices and gap-costs--as well as of programs is rated by rank tests.
Most sequence alignment programs perform equally well on RNA sequence sets with high sequence identity, that is with an average pairwise sequence identity (APSI) above 75 %. Parameters for gap-open and gap-extension have a large influence on alignment quality lower than APSI < or = 75 %; optimal parameter combinations are shown for several programs. The use of different 4 x 4 substitution matrices improved program performance only in some cases. The performance of iterative programs drastically increases with increasing sequence numbers and/or decreasing sequence identity, which makes them clearly superior to programs using a purely non-iterative, progressive approach. The best sequence alignment programs produce alignments of high quality down to APSI > 55 %; at lower APSI the use of sequence+structure alignment programs is recommended.
比对程序的性能传统上是在已知参考比对的蛋白质序列集上进行测试的。从这类蛋白质基准测试得出的结论不一定适用于RNA比对问题,正如迄今发布的首个RNA比对基准测试所表明的那样。例如,“黄昏区”(比对质量急剧下降的相似性范围)对于RNA而言从60%开始,而对于蛋白质则从20%开始。在本研究中,我们改进了先前的基准测试。
基准数据库中的RNA序列集取自更多数量的RNA家族,以避免仅使用少数家族带来的意外影响。序列集的大小从2到15个序列不等,以评估序列数量对程序性能的影响。比对质量通过两种度量来评分:一种仅考虑核苷酸匹配,另一种度量结构保守性。参数(如核苷酸替换矩阵和空位成本)以及程序的性能顺序通过秩检验来评定。
大多数序列比对程序在具有高序列同一性(即平均成对序列同一性(APSI)高于75%)的RNA序列集上表现相当。空位开放和空位延伸的参数对低于APSI≤75%的比对质量有很大影响;展示了几个程序的最佳参数组合。仅在某些情况下,使用不同的4×4替换矩阵可提高程序性能。迭代程序的性能随着序列数量增加和/或序列同一性降低而急剧提高,这使得它们明显优于使用纯非迭代渐进方法的程序。最佳的序列比对程序能产生低至APSI>55%的高质量比对;在较低的APSI时,建议使用序列+结构比对程序。