Cohen Shani, Rokach Lior, Veksler-Lublinsky Isana
Department of Software & Information Systems Engineering, Faculty of Engineering, Ben-Gurion University of the Negev, 8410501, Beer-Sheva, Israel.
BMC Bioinformatics. 2025 May 21;26(1):131. doi: 10.1186/s12859-025-06153-w.
Bacterial small RNAs (sRNAs) are pivotal in post-transcriptional regulation, affecting functions like virulence, metabolism, and gene expression by binding specific mRNA targets. Identifying these targets is crucial to understanding sRNA regulation across species. Despite advancements in high-throughput (HT) experimental methods, they remain technically challenging and are limited to detecting sRNA-target interactions under specific environmental conditions. Therefore, computational approaches, especially machine learning (ML), are essential for identifying strong candidates for biological validation. In this paper, we hypothesize that ML models trained on large-scale interaction data from specific conditions can accurately predict new interactions in unseen conditions within the same bacterial strain. To test this, we developed models from two families: (1) graph neural networks (GNNs), including GraphRNA and kGraphRNA, that learn transformed representations of interacting sRNA-mRNA pairs via graph relationships, and (2) decision forests, sInterRF (Random Forest) and sInterXGB (XGBoost), which use various interaction features for prediction. We also proposed Summation Ensemble Models (SEM) that combine scores from multiple models. Across three seen-to-unseen conditions evaluations, our models -particularly kGraphRNA- significantly improved the area under the ROC curve (AUC) and Precision-Recall curve (PR-AUC) compared to sRNARFTarget, CopraRNA, and RNAup. The SEM model combining GraphRNA and CopraRNA outperformed CopraRNA alone on a low-throughput (LT) interactions test set (HT-to-LT evaluation). Beyond enhanced performance, our models enable target prediction for species-specific sRNAs, a capability lacking in some existing tools. Furthermore, GNN models remove the dependency on external tools like RNAplex or RNAup to compute hybridization duplex or energy features, enhancing scalability and runtime efficiency. While this study focuses on E. coli K12 MG1655 interactions, our methods are fully adaptable to predict interactions in other bacterial strains, given sufficient data for training. Our comprehensive feature importance analysis revealed the complexity of sRNA-mRNA interactions across environmental conditions, underscoring the significance of RNA sequence composition and duplex structure characteristics, like base pairing and energy factors; findings that align with biological evidence from previous studies. As HT experiments expand sRNA-target interaction data across conditions in various bacteria, our ML methods with features analysis offer promising advances in sRNA-target prediction and deeper insights into sRNA regulatory mechanisms across diverse species.
细菌小RNA(sRNA)在转录后调控中起着关键作用,通过与特定的mRNA靶标结合来影响毒力、代谢和基因表达等功能。识别这些靶标对于理解跨物种的sRNA调控至关重要。尽管高通量(HT)实验方法取得了进展,但它们在技术上仍然具有挑战性,并且仅限于检测特定环境条件下的sRNA-靶标相互作用。因此,计算方法,特别是机器学习(ML),对于识别用于生物学验证的有力候选者至关重要。在本文中,我们假设在来自特定条件的大规模相互作用数据上训练的ML模型可以准确预测同一细菌菌株中未见条件下的新相互作用。为了验证这一点,我们从两个家族开发了模型:(1)图神经网络(GNN),包括GraphRNA和kGraphRNA,它们通过图关系学习相互作用的sRNA-mRNA对的变换表示;(2)决策森林,sInterRF(随机森林)和sInterXGB(XGBoost),它们使用各种相互作用特征进行预测。我们还提出了结合多个模型分数的求和集成模型(SEM)。在三次从可见到未见条件的评估中,与sRNARFTarget、CopraRNA和RNAup相比,我们的模型——特别是kGraphRNA——显著提高了ROC曲线下面积(AUC)和精确召回曲线(PR-AUC)。在低通量(LT)相互作用测试集上(HT到LT评估),结合GraphRNA和CopraRNA的SEM模型优于单独的CopraRNA。除了性能增强外,我们的模型还能够对物种特异性sRNA进行靶标预测,这是一些现有工具所缺乏的能力。此外,GNN模型消除了对RNAplex或RNAup等外部工具的依赖,以计算杂交双链体或能量特征,提高了可扩展性和运行时效率。虽然本研究重点关注大肠杆菌K12 MG1655的相互作用,但只要有足够的训练数据,我们的方法完全适用于预测其他细菌菌株中的相互作用。我们全面的特征重要性分析揭示了跨环境条件下sRNA-mRNA相互作用的复杂性,强调了RNA序列组成和双链体结构特征(如碱基配对和能量因素)的重要性;这些发现与先前研究的生物学证据一致。随着HT实验在各种细菌中跨条件扩展sRNA-靶标相互作用数据,我们具有特征分析的ML方法在sRNA-靶标预测方面提供了有前景的进展,并对跨不同物种的sRNA调控机制有了更深入的了解。