Will Sebastian, Reiche Kristin, Hofacker Ivo L, Stadler Peter F, Backofen Rolf
Bioinformatics Group, Institute of Computer Science, University of Freiburg, Freiburg, Germany.
PLoS Comput Biol. 2007 Apr 13;3(4):e65. doi: 10.1371/journal.pcbi.0030065. Epub 2007 Feb 22.
The RFAM database defines families of ncRNAs by means of sequence similarities that are sufficient to establish homology. In some cases, such as microRNAs and box H/ACA snoRNAs, functional commonalities define classes of RNAs that are characterized by structural similarities, and typically consist of multiple RNA families. Recent advances in high-throughput transcriptomics and comparative genomics have produced very large sets of putative noncoding RNAs and regulatory RNA signals. For many of them, evidence for stabilizing selection acting on their secondary structures has been derived, and at least approximate models of their structures have been computed. The overwhelming majority of these hypothetical RNAs cannot be assigned to established families or classes. We present here a structure-based clustering approach that is capable of extracting putative RNA classes from genome-wide surveys for structured RNAs. The LocARNA (local alignment of RNA) tool implements a novel variant of the Sankoff algorithm that is sufficiently fast to deal with several thousand candidate sequences. The method is also robust against false positive predictions, i.e., a contamination of the input data with unstructured or nonconserved sequences. We have successfully tested the LocARNA-based clustering approach on the sequences of the RFAM-seed alignments. Furthermore, we have applied it to a previously published set of 3,332 predicted structured elements in the Ciona intestinalis genome (Missal K, Rose D, Stadler PF (2005) Noncoding RNAs in Ciona intestinalis. Bioinformatics 21 (Supplement 2): i77-i78). In addition to recovering, e.g., tRNAs as a structure-based class, the method identifies several RNA families, including microRNA and snoRNA candidates, and suggests several novel classes of ncRNAs for which to date no representative has been experimentally characterized.
RFAM数据库通过足以建立同源性的序列相似性来定义非编码RNA家族。在某些情况下,如微小RNA和盒式H/ACA小核仁RNA,功能共性定义了以结构相似性为特征的RNA类别,通常由多个RNA家族组成。高通量转录组学和比较基因组学的最新进展产生了大量假定的非编码RNA和调控RNA信号。对于其中许多信号,已经获得了作用于其二级结构的稳定选择的证据,并且已经计算出了它们结构的至少近似模型。这些假设的RNA绝大多数无法归入已确立的家族或类别。我们在此提出一种基于结构的聚类方法,该方法能够从全基因组范围内的结构化RNA调查中提取假定的RNA类别。LocARNA(RNA局部比对)工具实现了Sankoff算法的一种新颖变体,其速度足够快,能够处理数千个候选序列。该方法对假阳性预测也具有鲁棒性,即输入数据不会被非结构化或非保守序列污染。我们已经在RFAM种子比对序列上成功测试了基于LocARNA的聚类方法。此外,我们将其应用于先前发表的一组在玻璃海鞘基因组中的3332个预测结构化元件(米萨尔·K、罗斯·D、施塔德勒·P·F(2005年)玻璃海鞘中的非编码RNA。《生物信息学》21(增刊2):i77 - i78)。除了例如将tRNA作为基于结构的类别进行恢复外,该方法还识别出几个RNA家族,包括微小RNA和小核仁RNA候选物,并提出了几个新型的非编码RNA类别,迄今为止尚无代表物经过实验表征。