Luo Xizi, Chi Amadeus Song Yi, Lin Andre Huikai, Ong Tze Jet, Wong Limsoon, Rahman Chowdhury Rafeed
School of Computing, National University of Singapore, Singapore 119077, Singapore.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae634.
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.
识别DNA结合蛋白(DBP)是基因组注释中的一项关键任务,因为它有助于理解基因调控、DNA复制、转录控制以及各种细胞过程。在本文中,我们对11种最先进的计算工具以及传统工具(如ScanProsite、BLAST和HMMER)进行了无偏基准测试,以识别DBP。我们强调了传统数据集中导致性能虚高的数据泄漏问题。我们引入了新的评估数据集以支持进一步的开发。通过全面的评估流程,我们识别了模型、特征提取技术和训练方法中的潜在局限性,并针对这些问题推荐了解决方案。我们表明,将两种最佳计算工具的预测结果与基于BLAST的预测相结合,可显著提高DBP识别能力。我们将这种共识方法作为用户友好型软件提供。数据集和软件可在https://github.com/Rafeed-bot/DNA_BP_Benchmarking上获取。