Zhang Xinru, Wang Shutao, Xie Lina, Zhu Yuhui
Department of Pharmacy, The Second Hospital of Jilin University, Changchun, China.
Front Genet. 2023 Jan 19;14:1121694. doi: 10.3389/fgene.2023.1121694. eCollection 2023.
Pseudouridine (Ψ) is one of the most abundant RNA modifications found in a variety of RNA types, and it plays a significant role in many biological processes. The key to studying the various biochemical functions and mechanisms of Ψ is to identify the Ψ sites. However, identifying Ψ sites using experimental methods is time-consuming and expensive. Therefore, it is necessary to develop computational methods that can accurately predict Ψ sites based on RNA sequence information. In this study, we proposed a new model called PseU-ST to identify Ψ sites in , , and . We selected the best six encoding schemes and four machine learning algorithms based on a comprehensive test of almost all of the RNA sequence encoding schemes available in the iLearnPlus software package, and selected the optimal features for each encoding scheme using chi-square and incremental feature selection algorithms. Then, we selected the optimal feature combination and the best base-classifier combination for each species through an extensive performance comparison and employed a stacking strategy to build the predictive model. The results demonstrated that PseU-ST achieved better prediction performance compared with other existing models. The PseU-ST accuracy scores were 93.64%, 87.74%, and 89.64% on H_990, S_628, and M_944, respectively, representing increments of 13.94%, 6.05%, and 0.26%, respectively, higher than the best existing methods on the same benchmark training datasets. The data indicate that PseU-ST is a very competitive prediction model for identifying RNA Ψ sites in , , and . In addition, we found that the Position-specific trinucleotide propensity based on single strand (PSTNPss) and Position-specific of three nucleotides (PS3) features play an important role in Ψ site identification. The source code for PseU-ST and the data are obtainable in our GitHub repository (https://github.com/jluzhangxinrubio/PseU-ST).
假尿苷(Ψ)是在多种RNA类型中发现的最丰富的RNA修饰之一,它在许多生物过程中发挥着重要作用。研究Ψ的各种生化功能和机制的关键是识别Ψ位点。然而,使用实验方法识别Ψ位点既耗时又昂贵。因此,有必要开发能够基于RNA序列信息准确预测Ψ位点的计算方法。在本研究中,我们提出了一种名为PseU-ST的新模型,用于识别H_990、S_628和M_944中的Ψ位点。我们基于对iLearnPlus软件包中几乎所有可用RNA序列编码方案的全面测试,选择了最佳的六种编码方案和四种机器学习算法,并使用卡方和增量特征选择算法为每种编码方案选择了最优特征。然后,我们通过广泛的性能比较为每个物种选择了最优特征组合和最佳基分类器组合,并采用堆叠策略构建预测模型。结果表明,与其他现有模型相比,PseU-ST实现了更好的预测性能。PseU-ST在H_990、S_628和M_944上的准确率分别为93.64%、87.74%和89.64%,分别比相同基准训练数据集上的最佳现有方法高出13.94%、6.05%和0.26%。数据表明,PseU-ST是用于识别H_990、S_628和M_944中RNA Ψ位点的极具竞争力的预测模型。此外,我们发现基于单链的位置特异性三核苷酸倾向(PSTNPss)和三个核苷酸的位置特异性(PS3)特征在Ψ位点识别中起重要作用。PseU-ST的源代码和数据可在我们的GitHub仓库(https://github.com/jluzhangxinrubio/PseU-ST)中获取。