Suppr超能文献

ESPDHot:一种基于机器学习的预测蛋白质-DNA 相互作用热点的有效方法。

ESPDHot: An Effective Machine Learning-Based Approach for Predicting Protein-DNA Interaction Hotspots.

机构信息

College of Chemistry and Life Science, Beijing University of Technology, Beijing 100124, China.

出版信息

J Chem Inf Model. 2024 Apr 22;64(8):3548-3557. doi: 10.1021/acs.jcim.3c02011. Epub 2024 Apr 8.

Abstract

Protein-DNA interactions are pivotal to various cellular processes. Precise identification of the hotspot residues for protein-DNA interactions holds great significance for revealing the intricate mechanisms in protein-DNA recognition and for providing essential guidance for protein engineering. Aiming at protein-DNA interaction hotspots, this work introduces an effective prediction method, ESPDHot based on a stacked ensemble machine learning framework. Here, the interface residue whose mutation leads to a binding free energy change (ΔΔ) exceeding 2 kcal/mol is defined as a hotspot. To tackle the imbalanced data set issue, the adaptive synthetic sampling (ADASYN), an oversampling technique, is adopted to synthetically generate new minority samples, thereby rectifying data imbalance. As for molecular characteristics, besides traditional features, we introduce three new characteristic types including residue interface preference proposed by us, residue fluctuation dynamics characteristics, and coevolutionary features. Combining the Boruta method with our previously developed Random Grouping strategy, we obtained an optimal set of features. Finally, a stacking classifier is constructed to output prediction results, which integrates three classical predictors, Support Vector Machine (SVM), XGBoost, and Artificial Neural Network (ANN) as the first layer, and Logistic Regression (LR) algorithm as the second one. Notably, ESPDHot outperforms the current state-of-the-art predictors, achieving superior performance on the independent test data set, with F1, MCC, and AUC reaching 0.571, 0.516, and 0.870, respectively.

摘要

蛋白质与 DNA 的相互作用对各种细胞过程至关重要。准确识别蛋白质与 DNA 相互作用的热点残基对于揭示蛋白质与 DNA 识别的复杂机制以及为蛋白质工程提供重要指导具有重要意义。针对蛋白质与 DNA 的相互作用热点,本研究引入了一种基于堆叠集成机器学习框架的有效预测方法 ESPDHot。这里,将突变导致结合自由能变化(ΔΔ)超过 2kcal/mol 的界面残基定义为热点。为了解决不平衡数据集问题,采用自适应合成抽样(ADASYN)过采样技术来综合生成新的少数样本,从而纠正数据不平衡。至于分子特征,除了传统特征外,我们还引入了三种新的特征类型,包括我们提出的残基界面偏好、残基波动动力学特征和共进化特征。结合 Boruta 方法和我们之前开发的随机分组策略,我们获得了最佳特征集。最后,构建堆叠分类器输出预测结果,它集成了三个经典预测器,支持向量机(SVM)、XGBoost 和人工神经网络(ANN)作为第一层,逻辑回归(LR)算法作为第二层。值得注意的是,ESPDHot 优于当前最先进的预测器,在独立测试数据集上表现出优异的性能,F1、MCC 和 AUC 分别达到 0.571、0.516 和 0.870。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验