Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan.
Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon, 16419, Republic of Korea.
Comput Biol Med. 2024 Feb;169:107848. doi: 10.1016/j.compbiomed.2023.107848. Epub 2023 Dec 13.
Dihydrouridine (DHU, D) is one of the most abundant post-transcriptional uridine modifications found in tRNA, mRNA, and snoRNA, closely associated with disease pathogenesis and various biological processes in eukaryotes. Identifying D sites is important for understanding the modification mechanisms and/or epigenetic regulation. However, biological experiments for detecting D sites are time-consuming and expensive. Given these challenges, computational methods have been developed for accurately identifying the D sites in genome-wide datasets. However, existing methods have some limitations, and their prediction performance needs to be improved. In this work, we have developed a new computational predictor for accurately identifying D sites called Stack-DHUpred. Briefly, we trained 66 baseline models or single-feature models by connecting six machine learning classifiers with eleven different feature encoding methods and stacked different baseline models to build stacked ensemble learning models. Subsequently, the optimal combination of the baseline models was identified for the construction of the final stacked model. Remarkably, the Stack-DHUpred outperformed the existing predictors on our new independent dataset, indicating that the stacking approach significantly improved the prediction performance. We have made Stack-DHUpred available to the public through a web server (http://kurata35.bio.kyutech.ac.jp/Stack-DHUpred) and a standalone program (https://github.com/kuratahiroyuki/Stack-DHUpred). We believe that Stack-DHUpred will be a valuable tool for accelerating the discovery of D modifications and understanding their role in post-transcriptional regulation.
二氢尿嘧啶(DHU,D)是 tRNA、mRNA 和 snoRNA 中含量最丰富的转录后尿嘧啶修饰之一,与真核生物的疾病发病机制和各种生物学过程密切相关。鉴定 D 位点对于了解修饰机制和/或表观遗传调控至关重要。然而,用于检测 D 位点的生物学实验既耗时又昂贵。鉴于这些挑战,已经开发了计算方法来准确识别全基因组数据集中的 D 位点。然而,现有的方法存在一些局限性,需要提高其预测性能。在这项工作中,我们开发了一种新的计算预测器,称为 Stack-DHUpred,用于准确识别 D 位点。简而言之,我们通过将六个机器学习分类器与十一种不同的特征编码方法连接起来,训练了 66 个基线模型或单特征模型,并堆叠不同的基线模型来构建堆叠集成学习模型。随后,确定了基线模型的最佳组合,用于构建最终的堆叠模型。值得注意的是,Stack-DHUpred 在我们的新独立数据集上优于现有的预测器,表明堆叠方法显著提高了预测性能。我们通过一个网络服务器(http://kurata35.bio.kyutech.ac.jp/Stack-DHUpred)和一个独立的程序(https://github.com/kuratahiroyuki/Stack-DHUpred)将 Stack-DHUpred 提供给公众。我们相信 Stack-DHUpred 将成为加速发现 D 修饰并理解其在后转录调控中作用的有价值的工具。