School of Automotive and Transportation Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong China.
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan China.
RNA Biol. 2021 Dec;18(12):2236-2246. doi: 10.1080/15476286.2021.1898160. Epub 2021 Mar 17.
As one of the common post-transcriptional modifications in tRNAs, dihydrouridine (D) has prominent effects on regulating the flexibility of tRNA as well as cancerous diseases. Facing with the expensive and time-consuming sequencing techniques to detect D modification, precise computational tools can largely promote the progress of molecular mechanisms and medical developments. We proposed a novel predictor, called iRNAD_XGBoost, to identify potential D sites using multiple RNA sequence representations. In this method, by considering the imbalance problem using hybrid sampling method SMOTEEEN, the XGBoost-selected top 30 features are applied to construct model. The optimized model showed high and values of 97.13% and 97.38% over jackknife test, respectively. For the independent experiment, these two metrics separately achieved 91.67% and 94.74%. Compared with iRNAD method, this model illustrated high generalizability and consistent prediction efficiencies for positive and negative samples, which yielded satisfactory scores of 0.94 and 0.86, respectively. It is inferred that the chemical property and nucleotide density features (CPND), electron-ion interaction pseudopotential (EIIP and PseEIIP) as well as dinucleotide composition (DNC) are crucial to the recognition of D modification. The proposed predictor is a promising tool to help experimental biologists investigate molecular functions.
作为 tRNA 中常见的转录后修饰之一,二氢尿嘧啶(D)对调节 tRNA 的柔韧性以及癌症等疾病具有显著影响。面对昂贵且耗时的测序技术来检测 D 修饰,精确的计算工具可以极大地促进分子机制和医学发展的进步。我们提出了一种名为 iRNAD_XGBoost 的新型预测器,该预测器使用多种 RNA 序列表示来识别潜在的 D 位点。在该方法中,通过使用混合采样方法 SMOTEEEN 考虑不平衡问题,将 XGBoost 选择的前 30 个特征应用于构建模型。优化后的模型在 jackknife 测试中分别具有 97.13%和 97.38%的高 和 值。对于独立实验,这两个指标分别达到了 91.67%和 94.74%。与 iRNAD 方法相比,该模型对正、负样本具有较高的通用性和一致的预测效率,其 分数分别为 0.94 和 0.86。可以推断,化学性质和核苷酸密度特征(CPND)、电子-离子相互作用伪势(EIIP 和 PseEIIP)以及二核苷酸组成(DNC)对于 D 修饰的识别至关重要。该预测器是帮助实验生物学家研究分子功能的有前途的工具。