School of Computer Science and Technology, Xidian University, Xi'an 710071, China.
Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
Int J Mol Sci. 2022 Mar 11;23(6):3044. doi: 10.3390/ijms23063044.
Dihydrouridine (D) is an abundant post-transcriptional modification present in transfer RNA from eukaryotes, bacteria, and archaea. D has contributed to treatments for cancerous diseases. Therefore, the precise detection of D modification sites can enable further understanding of its functional roles. Traditional experimental techniques to identify D are laborious and time-consuming. In addition, there are few computational tools for such analysis. In this study, we utilized eleven sequence-derived feature extraction methods and implemented five popular machine algorithms to identify an optimal model. During data preprocessing, data were partitioned for training and testing. Oversampling was also adopted to reduce the effect of the imbalance between positive and negative samples. The best-performing model was obtained through a combination of random forest and nucleotide chemical property modeling. The optimized model presented high sensitivity and specificity values of 0.9688 and 0.9706 in independent tests, respectively. Our proposed model surpassed published tools in independent tests. Furthermore, a series of validations across several aspects was conducted in order to demonstrate the robustness and reliability of our model.
二氢尿嘧啶 (D) 是真核生物、细菌和古菌转移 RNA 中丰富的转录后修饰。D 有助于治疗癌症疾病。因此,精确检测 D 修饰位点可以进一步了解其功能作用。传统的实验技术识别 D 是费力且耗时的。此外,用于此类分析的计算工具很少。在这项研究中,我们利用了十一种序列衍生的特征提取方法,并实现了五种流行的机器算法来识别最佳模型。在数据预处理过程中,数据被分为训练和测试。我们还采用了过采样来减少正负样本不平衡的影响。通过随机森林和核苷酸化学性质建模的组合,获得了表现最佳的模型。该优化模型在独立测试中分别表现出 0.9688 和 0.9706 的高灵敏度和特异性值。我们提出的模型在独立测试中超过了已发表的工具。此外,我们还进行了一系列多方面的验证,以证明我们模型的稳健性和可靠性。