Mu Qiang, Yu Guoping, Zhou Guomin, He Yubing, Zhang Jianhua
Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China.
National Nanfan Research Institute (Sanya), Chinese Academy of Agricultural Sciences/Hainan Seed Industry Laboratory, Sanya 572024, China.
NAR Genom Bioinform. 2025 May 19;7(2):lqaf058. doi: 10.1093/nargab/lqaf058. eCollection 2025 Jun.
Regulation of DNA or RNA at the transcriptional, post-transcriptional, and translational levels are key steps in the central dogma of molecular biology. DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) play pivotal roles in the precise regulation of gene expression in these steps. Both of these two classes of proteins are nucleic acid-binding proteins (NABPs), so they exhibit significant similarity in both sequence and structure. However, traditional methods for identifying NABPs are typically time-consuming, costly, and challenging to scale up. Utilizing deep learning to classify proteins intelligently has emerged as a more efficient solution for these issues. In this study, we propose a phased classification method integrating ESM-2 with a dual-path neural network, called DRBP-EDP. Additionally, a refined approach to dataset construction is designed, resulting in the creation of high-quality protein classification datasets. The results demonstrated that the model achieved strong performance, with 90.03% accuracy in the first stage for classifying NABPs and non-nucleic acid-binding proteins, and 89.56% accuracy in the second stage for classifying DBPs and RBPs. To enhance accessibility and usability, DRBP-EDP has been developed in both executable and web-based versions, which are publicly available at https://doi.org/10.5281/zenodo.14092184 and https://github.com/MuQiang-MQ/DRBP-EDP.
在转录、转录后和翻译水平对DNA或RNA进行调控是分子生物学中心法则的关键步骤。DNA结合蛋白(DBP)和RNA结合蛋白(RBP)在这些步骤中基因表达的精确调控中发挥着关键作用。这两类蛋白都是核酸结合蛋白(NABP),因此它们在序列和结构上都表现出显著的相似性。然而,传统的识别NABP的方法通常耗时、成本高且难以扩大规模。利用深度学习对蛋白质进行智能分类已成为解决这些问题的一种更有效的方法。在本研究中,我们提出了一种将ESM-2与双路径神经网络相结合的分阶段分类方法,称为DRBP-EDP。此外,还设计了一种改进的数据集构建方法,从而创建了高质量的蛋白质分类数据集。结果表明,该模型表现出色,在第一阶段对NABP和非核酸结合蛋白进行分类时准确率为90.03%,在第二阶段对DBP和RBP进行分类时准确率为89.56%。为了提高可访问性和可用性,DRBP-EDP已开发出可执行版本和基于网络的版本,可在https://doi.org/10.5281/zenodo.14092184和https://github.com/MuQiang-MQ/DRBP-EDP上公开获取。