Torarinsson Elfar, Yao Zizhen, Wiklund Eric D, Bramsen Jesper B, Hansen Claus, Kjems Jørgen, Tommerup Niels, Ruzzo Walter L, Gorodkin Jan
Section for Genetics and Bioinformatics, IBVH, Faculty of Life Sciences, University of Copenhagen, 1870 Frederiksberg C, Denmark.
Genome Res. 2008 Feb;18(2):242-51. doi: 10.1101/gr.6887408. Epub 2007 Dec 20.
Recent computational scans for non-coding RNAs (ncRNAs) in multiple organisms have relied on existing multiple sequence alignments. However, as sequence similarity drops, a key signal of RNA structure--frequent compensating base changes--is increasingly likely to cause sequence-based alignment methods to misalign, or even refuse to align, homologous ncRNAs, consequently obscuring that structural signal. We have used CMfinder, a structure-oriented local alignment tool, to search the ENCODE regions of vertebrate multiple alignments. In agreement with other studies, we find a large number of potential RNA structures in the ENCODE regions. We report 6587 candidate regions with an estimated false-positive rate of 50%. More intriguingly, many of these candidates may be better represented by alignments taking the RNA secondary structure into account than those based on primary sequence alone, often quite dramatically. For example, approximately one-quarter of our predicted motifs show revisions in >50% of their aligned positions. Furthermore, our results are strongly complementary to those discovered by sequence-alignment-based approaches--84% of our candidates are not covered by Washietl et al., increasing the number of ncRNA candidates in the ENCODE region by 32%. In a group of 11 ncRNA candidates that were tested by RT-PCR, 10 were confirmed to be present as RNA transcripts in human tissue, and most show evidence of significant differential expression across tissues. Our results broadly suggest caution in any analysis relying on multiple sequence alignments in less well-conserved regions, clearly support growing appreciation for the biological significance of ncRNAs, and strongly support the argument for considering RNA structure directly in any searches for these elements.
最近对多种生物中的非编码RNA(ncRNA)进行的计算扫描依赖于现有的多序列比对。然而,随着序列相似性的降低,RNA结构的一个关键信号——频繁的补偿性碱基变化——越来越有可能导致基于序列的比对方法出现比对错误,甚至拒绝比对同源ncRNA,从而掩盖了该结构信号。我们使用了CMfinder(一种面向结构的局部比对工具)来搜索脊椎动物多序列比对的ENCODE区域。与其他研究一致,我们在ENCODE区域发现了大量潜在的RNA结构。我们报告了6587个候选区域,估计假阳性率为50%。更有趣的是,与仅基于一级序列的比对相比,考虑RNA二级结构的比对可能能更好地代表其中许多候选区域,而且往往差异显著。例如,我们预测的基序中约四分之一在其比对位置上有超过50%的修正。此外,我们的结果与基于序列比对的方法所发现的结果具有很强的互补性——我们的候选区域中有84%未被瓦西特尔等人涵盖,这使得ENCODE区域中ncRNA候选区域的数量增加了32%。在一组通过逆转录聚合酶链反应(RT-PCR)测试的11个ncRNA候选区域中,有10个被证实以RNA转录本的形式存在于人体组织中,并且大多数显示出在不同组织中存在显著差异表达的证据。我们的结果广泛表明,在不太保守的区域进行任何依赖多序列比对的分析时都应谨慎,明确支持对ncRNA生物学意义的日益重视,并有力支持在搜索这些元件时直接考虑RNA结构的观点。