School of Computer Science, McGill University, Montréal, Canada.
LiX, École Polytechnique, Paris, France.
PLoS Comput Biol. 2021 May 28;17(5):e1008990. doi: 10.1371/journal.pcbi.1008990. eCollection 2021 May.
RNA tertiary structure is crucial to its many non-coding molecular functions. RNA architecture is shaped by its secondary structure composed of stems, stacked canonical base pairs, enclosing loops. While stems are precisely captured by free-energy models, loops composed of non-canonical base pairs are not. Nor are distant interactions linking together those secondary structure elements (SSEs). Databases of conserved 3D geometries (a.k.a. modules) not captured by energetic models are leveraged for structure prediction and design, but the computational complexity has limited their study to local elements, loops. Representing the RNA structure as a graph has recently allowed to expend this work to pairs of SSEs, uncovering a hierarchical organization of these 3D modules, at great computational cost. Systematically capturing recurrent patterns on a large scale is a main challenge in the study of RNA structures. In this paper, we present an efficient algorithm to compute maximal isomorphisms in edge colored graphs. We extend this algorithm to a framework well suited to identify RNA modules, and fast enough to considerably generalize previous approaches. To exhibit the versatility of our framework, we first reproduce results identifying all common modules spanning more than 2 SSEs, in a few hours instead of weeks. The efficiency of our new algorithm is demonstrated by computing the maximal modules between any pair of entire RNA in the non-redundant corpus of known RNA 3D structures. We observe that the biggest modules our method uncovers compose large shared sub-structure spanning hundreds of nucleotides and base pairs between the ribosomes of Thermus thermophilus, Escherichia Coli, and Pseudomonas aeruginosa.
RNA 的三级结构对其许多非编码分子功能至关重要。RNA 的结构由其二级结构形成,二级结构由茎、堆积的规范碱基对、封闭的环组成。虽然茎可以被自由能模型精确捕捉,但由非规范碱基对组成的环则不能。连接这些二级结构元件(SSEs)的远距离相互作用也不能。数据库中保守的 3D 几何形状(也称为模块)没有被能量模型捕获,用于结构预测和设计,但计算复杂性限制了它们对局部元素、环的研究。最近,将 RNA 结构表示为图的方法允许将这项工作扩展到 SSE 对,揭示了这些 3D 模块的层次组织,但计算成本很高。在大规模上系统地捕捉重复模式是 RNA 结构研究的主要挑战。在本文中,我们提出了一种有效算法来计算边着色图中的最大同构。我们将此算法扩展到一个非常适合识别 RNA 模块的框架中,并且足够快,可以极大地推广以前的方法。为了展示我们框架的多功能性,我们首先重现了识别跨越超过 2 个 SSE 的所有常见模块的结果,仅需几个小时而不是几周。我们的新算法的效率通过计算已知 RNA 3D 结构非冗余文库中任何两个完整 RNA 之间的最大模块来证明。我们观察到,我们的方法揭示的最大模块由跨越 Thermus thermophilus、Escherichia Coli 和 Pseudomonas aeruginosa 核糖体数百个核苷酸和碱基对的大共享子结构组成。