Wang Rui, Schlick Tamar
Simons Center for Computational Physical Chemistry, New York University, New York, New York, United States of America.
Department of Chemistry, New York University, New York, New York, United States of America.
PLoS Comput Biol. 2025 Jul 15;21(7):e1013230. doi: 10.1371/journal.pcbi.1013230. eCollection 2025 Jul.
Identifying novel and functional RNA structures remains a significant challenge in RNA motif design and is crucial for developing RNA-based therapeutics. Here we introduce a computational topology-based approach with unsupervised machine-learning algorithms to estimate the database size and content of RNA-like graph topologies. Specifically, we apply graph theory enumeration to generate all 110,667 possible 2D dual graphs for vertex numbers ranging from 2 to 9. Among them, only 0.11% (121 dual graphs) correspond to approximately 200,000 known RNA atomic fragments/substructures (collected in 2021) using the RNA-as-Graphs (RAG) framework. The remaining 99.89% of the dual graphs may be RNA-like or non-RNA-like. To determine which dual graphs in the 99.89% hypothetical set are more likely to be associated with RNA structures, we apply computational topology descriptors using the Persistent Spectral Graphs (PSG) method to characterize each graph using 19 PSG-based features and use clustering algorithms that partition all possible dual graphs into two clusters. The cluster with the higher percentage of known dual graphs for RNA is defined as the "RNA-like" cluster, while the other is considered as "non-RNA-like". The distance between each dual graph and the center of the RNA-like cluster represents the likelihood of it belonging to RNA structures. From validation, our PSG-based RNA-like cluster includes 97.3% of the 121 known RNA dual graphs, suggesting good performance. Furthermore, 46.017% of the hypothetical RNAs are predicted to be RNA-like. Among the top 15 graphs identified as high-likelihood candidates for novel RNA motifs, 4 were confirmed from the RNA dataset collected in 2022. Significantly, we observe that all the top 15 RNA-like dual graphs can be separated into multiple subgraphs, whereas the top 15 non-RNA-like dual graphs tend not to have any subgraphs (subgraphs preserve pseudoknots and junctions). Moreover, a significant topological difference between top RNA-like and non-RNA-like graphs is evident when comparing their topological features (e.g., Betti-0 and Betti-1 numbers). These findings provide valuable insights into the size of the RNA motif universe and RNA design strategies, offering a novel framework for predicting RNA graph topologies and guiding the discovery of novel RNA motifs, perhaps anti-viral therapeutics by subgraph assembly.
识别新型且具有功能的RNA结构仍然是RNA基序设计中的一项重大挑战,对于开发基于RNA的疗法至关重要。在此,我们引入一种基于计算拓扑学的方法,并结合无监督机器学习算法,来估计类RNA图形拓扑的数据库大小和内容。具体而言,我们应用图论枚举法生成顶点数从2到9的所有110,667种可能的二维对偶图。其中,使用“RNA作为图形”(RAG)框架,只有0.11%(121个对偶图)对应于约200,000个已知的RNA原子片段/子结构(于2021年收集)。其余99.89%的对偶图可能是类RNA的,也可能不是类RNA的。为了确定99.89%的假设集合中的哪些对偶图更有可能与RNA结构相关联,我们应用基于持久谱图(PSG)方法的计算拓扑描述符,使用19个基于PSG的特征来表征每个图形,并使用聚类算法将所有可能的对偶图划分为两个聚类。已知RNA对偶图比例较高的聚类被定义为“类RNA”聚类,而另一个则被视为“非类RNA”聚类。每个对偶图与类RNA聚类中心之间的距离代表其属于RNA结构的可能性。通过验证,我们基于PSG的类RNA聚类包含121个已知RNA对偶图中的97.3%,表明性能良好。此外,预计46.017%的假设RNA是类RNA的。在被确定为新型RNA基序高可能性候选的前15个图形中,有4个从2022年收集的RNA数据集中得到了证实。值得注意的是,我们观察到所有前15个类RNA对偶图都可以分离为多个子图,而前15个非类RNA对偶图往往没有任何子图(子图保留假结和连接)。此外,在比较它们的拓扑特征(例如,贝蒂数0和贝蒂数1)时,顶级类RNA和非类RNA图形之间存在明显的拓扑差异。这些发现为RNA基序宇宙的大小和RNA设计策略提供了有价值的见解,为预测RNA图形拓扑和指导新型RNA基序的发现(可能通过子图组装开发抗病毒疗法)提供了一个新的框架。