Suppr超能文献

StackDPP:一种基于堆叠集成的 DNA 结合蛋白预测模型。

StackDPP: a stacking ensemble based DNA-binding protein prediction model.

机构信息

Department of CSE, BUET, ECE Building, West Palashi, Dhaka, 1000, Bangladesh.

出版信息

BMC Bioinformatics. 2024 Mar 14;25(1):111. doi: 10.1186/s12859-024-05714-9.

Abstract

BACKGROUND

DNA-binding proteins (DNA-BPs) are the proteins that bind and interact with DNA. DNA-BPs regulate and affect numerous biological processes, such as, transcription and DNA replication, repair, and organization of the chromosomal DNA. Very few proteins, however, are DNA-binding in nature. Therefore, it is necessary to develop an efficient predictor for identifying DNA-BPs.

RESULT

In this work, we have proposed new benchmark datasets for the DNA-binding protein prediction problem. We discovered several quality concerns with the widely used benchmark datasets, PDB1075 (for training) and PDB186 (for independent testing), which necessitated the preparation of new benchmark datasets. Our proposed datasets UNIPROT1424 and UNIPROT356 can be used for model training and independent testing respectively. We have retrained selected state-of-the-art DNA-BP predictors in the new dataset and reported their performance results. We also trained a novel predictor using the new benchmark dataset. We extracted features from various feature categories, then used a Random Forest classifier and Recursive Feature Elimination with Cross-validation (RFECV) to select the optimal set of 452 features. We then proposed a stacking ensemble architecture as our final prediction model. Named Stacking Ensemble Model for DNA-binding Protein Prediction, or StackDPP in short, our model achieved 0.92, 0.92 and 0.93 accuracy in 10-fold cross-validation, jackknife and independent testing respectively.

CONCLUSION

StackDPP has performed very well in cross-validation testing and has outperformed all the state-of-the-art prediction models in independent testing. Its performance scores in cross-validation testing generalized very well in the independent test set. The source code of the model is publicly available at https://github.com/HasibAhmed1624/StackDPP . Therefore, we expect this generalized model can be adopted by researchers and practitioners to identify novel DNA-binding proteins.

摘要

背景

DNA 结合蛋白(DNA-BP)是与 DNA 结合和相互作用的蛋白质。DNA-BP 调节和影响许多生物过程,如转录和 DNA 复制、修复以及染色体 DNA 的组织。然而,只有极少数蛋白质具有天然的 DNA 结合能力。因此,有必要开发一种有效的预测器来识别 DNA-BP。

结果

在这项工作中,我们为 DNA 结合蛋白预测问题提出了新的基准数据集。我们发现广泛使用的基准数据集 PDB1075(用于训练)和 PDB186(用于独立测试)存在几个质量问题,这就需要准备新的基准数据集。我们提出的 UNIPROT1424 和 UNIPROT356 数据集可分别用于模型训练和独立测试。我们在新数据集重新训练了选定的最先进的 DNA-BP 预测器,并报告了它们的性能结果。我们还使用新的基准数据集训练了一个新的预测器。我们从各种特征类别中提取特征,然后使用随机森林分类器和递归特征消除与交叉验证(RFECV)选择最佳的 452 个特征集。然后,我们提出了一个堆叠集成架构作为最终的预测模型。命名为 DNA 结合蛋白预测的堆叠集成模型,简称 StackDPP,我们的模型在 10 折交叉验证、jackknife 和独立测试中分别实现了 0.92、0.92 和 0.93 的准确率。

结论

StackDPP 在交叉验证测试中表现非常出色,在独立测试中优于所有最先进的预测模型。它在交叉验证测试中的性能得分在独立测试集中得到了很好的概括。该模型的源代码可在 https://github.com/HasibAhmed1624/StackDPP 上获得。因此,我们期望这个通用模型可以被研究人员和从业者采用来识别新的 DNA 结合蛋白。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/28cc/10941422/200f149e7e6f/12859_2024_5714_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验