Kim Namhee, Shiffeldrim Nahum, Gan Hin Hark, Schlick Tamar
Department of Chemistry, New York University, 100 Washington Square East, Room 1001, New York, NY 10003, USA.
J Mol Biol. 2004 Aug 27;341(5):1129-44. doi: 10.1016/j.jmb.2004.06.054.
Because the functional repertiore of RNA molecules, like proteins, is closely linked to the diversity of their shapes, uncovering RNA's structural repertoire is vital for identifying novel RNAs, especially in genomic sequences. To help expand the limited number of known RNA families, we use graphical representation and clustering analysis of RNA secondary structures to predict novel RNA topologies and their abundance as a function of size. Representing the essential topological properties of RNA secondary structures as graphs enables enumeration, generation, and prediction of novel RNA motifs. We apply a probabilistic graph-growing method to construct the RNA structure space encompassing the topologies of existing and hypothetical RNAs and cluster all RNA topologies into two groups using topological descriptors and a standard clustering algorithm. Significantly, we find that nearly all existing RNAs fall into one group, which we refer to as "RNA-like"; we consider the other group "non-RNA-like". Our method predicts many candidates for novel RNA secondary topologies, some of which are remarkably similar to existing structures; interestingly, the centroid of the RNA-like group is the tmRNA fold, a pseudoknot having both tRNA-like and mRNA-like functions. Additionally, our approach allows estimation of the relative abundance of pseudoknot and other (e.g. tree) motifs using the "edge-cut" property of RNA graphs. This analysis suggests that pseudoknots dominate the RNA structure universe, representing more than 90% when the sequence length exceeds 120 nt; the predicted trend for <100 nt agrees with data for existing RNAs. Together with our predictions for novel "RNA-like" topologies, our analysis can help direct the design of functional RNAs and identification of novel RNA folds in genomes through an efficient topology-directed search, which grows much more slowly in complexity with RNA size compared to the traditional sequence-based search.
由于RNA分子的功能 repertoire,如同蛋白质一样,与其形状的多样性紧密相连,揭示RNA的结构 repertoire对于识别新型RNA至关重要,尤其是在基因组序列中。为了帮助扩展已知RNA家族数量的限制,我们使用RNA二级结构的图形表示和聚类分析来预测新型RNA拓扑结构及其作为大小函数的丰度。将RNA二级结构的基本拓扑特性表示为图形能够枚举、生成和预测新型RNA基序。我们应用一种概率性图形生长方法来构建包含现有和假设RNA拓扑结构的RNA结构空间,并使用拓扑描述符和标准聚类算法将所有RNA拓扑结构聚类为两组。值得注意的是,我们发现几乎所有现有的RNA都属于一组,我们将其称为“类RNA”;我们将另一组视为“非类RNA”。我们的方法预测了许多新型RNA二级拓扑结构的候选者,其中一些与现有结构非常相似;有趣的是,类RNA组的质心是tmRNA折叠,一种具有tRNA样和mRNA样功能的假结。此外,我们的方法允许使用RNA图形的“边切割”特性估计假结和其他(例如树状)基序的相对丰度。该分析表明,假结在RNA结构宇宙中占主导地位,当序列长度超过120 nt时占比超过90%;对于<100 nt的预测趋势与现有RNA的数据一致。连同我们对新型“类RNA”拓扑结构的预测,我们的分析可以通过高效的拓扑导向搜索帮助指导功能性RNA的设计和基因组中新型RNA折叠的识别,与传统的基于序列的搜索相比,其复杂度随RNA大小的增长要慢得多。