Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States; Computers and Systems Department, National Telecommunication Institute, Cairo, Egypt.
Department of BioHealth Informatics, School of Informatics and Computing, Indiana University Purdue University, 535 West Michigan Street, Indianapolis, IN 46202, United States; Computer Science Department, University of Texas Rio Grande Valley, United States.
Methods. 2022 Jul;203:478-487. doi: 10.1016/j.ymeth.2022.02.005. Epub 2022 Feb 16.
Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and has been reported to have application in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies have enabled direct detection of RNA modifications on the molecule being sequenced. In this study, we introduce a tool called Penguin that integrates several machine learning (ML) models to identify RNA Pseudouridine sites on Nanopore direct RNA sequencing reads. Pseudouridine sites were identified on single molecule sequencing data collected from direct RNA sequencing resulting in 723 K reads in Hek293 and 500 K reads in Hela cell lines. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, can predict whether the signal is modified by the presence of Pseudouridine sites in the testing phase. We have included various predictors in Penguin, including Support vector machines (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets for Hek293 and Hela cell lines show outstanding performance of Penguin either in random split testing or in independent validation testing. In random split testing, Penguin has been able to identify Pseudouridine sites with a high accuracy of 93.38% by applying SVM to Hek293 benchmark dataset. In independent validation testing, Penguin achieves an accuracy of 92.61% by training SVM with Hek293 benchmark dataset and testing it for identifying Pseudouridine sites on Hela benchmark dataset. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature by 16 % higher accuracy than those predictors using independent validation testing. Employing penguin to predict Pseudouridine sites revealed a significant enrichment of "regulation of mRNA 3'-end processing" in Hek293 cell line and 'positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus' in Hela cell line. Penguin software and models are available on GitHub at https://github.com/Janga-Lab/Penguin and can be readily employed for predicting Ψ sites from Nanopore direct RNA-sequencing datasets.
假尿嘧啶核苷是最丰富的 RNA 修饰物之一,当尿嘧啶被假尿嘧啶合酶蛋白催化时就会发生这种修饰。它在许多生物过程中起着重要作用,并已被报道可应用于药物开发。最近,牛津纳米孔技术(Oxford Nanopore technologies)等单分子测序技术,使人们能够直接检测正在测序的分子上的 RNA 修饰。在这项研究中,我们引入了一个名为 Penguin 的工具,它集成了几个机器学习(ML)模型,用于识别纳米孔直接 RNA 测序读取中的 RNA 假尿嘧啶核苷位点。在直接 RNA 测序中从单分子测序数据中鉴定出假尿嘧啶核苷位点,在 Hek293 细胞系中产生了 723 K 个读取,在 Hela 细胞系中产生了 500 K 个读取。Penguin 从牛津纳米孔测量的原始信号和相应的碱基调用 k-mer 中提取一组特征。这些特征被用于训练 Penguin 中包含的预测器,预测器可以在测试阶段预测信号是否被假尿嘧啶核苷位点的存在所修饰。我们在 Penguin 中包含了各种预测器,包括支持向量机(SVM)、随机森林(RF)和神经网络(NN)。在 Hek293 和 Hela 细胞系的两个基准数据集上的结果表明,Penguin 在随机拆分测试或独立验证测试中都表现出了出色的性能。在随机拆分测试中,Penguin 通过在 Hek293 基准数据集上应用 SVM,能够以 93.38%的高精度识别假尿嘧啶核苷位点。在独立验证测试中,Penguin 通过使用 Hek293 基准数据集训练 SVM,并在 Hela 基准数据集上测试它来识别假尿嘧啶核苷位点,实现了 92.61%的准确率。因此,Penguin 在文献中现有的假尿嘧啶核苷预测器的基础上,通过独立验证测试提高了 16%的准确率。使用 Penguin 预测假尿嘧啶核苷位点,在 Hek293 细胞系中发现“mRNA 3'-末端加工的调节”显著富集,在 Hela 细胞系中发现“涉及细胞对化学刺激的反应的 RNA 聚合酶 II 启动子转录的正调节”。Penguin 软件和模型可在 GitHub 上获得,网址为 https://github.com/Janga-Lab/Penguin,可方便地用于从纳米孔直接 RNA-seq 数据集预测 Ψ 位点。