Suppr超能文献

BASE:一个提供具有降低相似性偏差的化合物-蛋白质结合亲和力预测数据集的网络服务。

BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.

机构信息

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

出版信息

BMC Bioinformatics. 2024 Oct 30;25(1):340. doi: 10.1186/s12859-024-05968-3.

Abstract

BACKGROUND

Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.

RESULTS

By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.

CONCLUSIONS

We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .

摘要

背景

基于深度学习的药物-靶标亲和力(DTA)预测方法表现出了令人印象深刻的性能,尽管相对于可用数据而言,其训练参数数量很多。先前的研究通过表明仅基于蛋白质或配体结构训练的模型可能与基于复杂结构训练的模型表现相当,强调了数据集偏差的存在。然而,这些研究并没有提出解决方案,而是仅专注于分析基于复杂结构的模型。即使排除配体,仅基于复杂结构训练的蛋白质模型在结合部位仍会包含一些配体信息。因此,由于潜在的数据集偏差,尚不清楚是否仅使用化合物或蛋白质特征就可以准确预测结合亲和力。在这项研究中,我们扩展了分析范围,纳入了综合数据库,并使用多层感知机模型通过化合物和蛋白质特征方法研究了数据集偏差。我们评估了这种偏差对当前预测模型的影响,并提出了结合亲和力相似性探索器(BASE)网络服务,该服务提供了降低偏差的数据集。

结果

通过使用多层感知机模型分析八个结合亲和力数据库,我们确认了一种偏差,即仅使用化合物特征就可以准确预测化合物-蛋白质的结合亲和力。这种偏差是由于大多数化合物由于其靶蛋白之间具有高序列或功能相似性而表现出一致的结合亲和力所致。我们基于化合物指纹的统一流形逼近和投影分析进一步表明,低变异性和高变异性化合物之间没有显著的结构差异。这表明驱动一致结合亲和力的主要因素是蛋白质相似性,而不是化合物结构。我们通过在训练集和测试集之间逐步降低蛋白质相似性来创建数据集来解决这种偏差,观察到模型性能的显著变化。我们开发了 BASE 网络服务,以便研究人员可以下载并使用这些数据集。特征重要性分析表明,先前的模型严重依赖于蛋白质特征。但是,使用降低偏差的数据集增加了化合物和相互作用特征的重要性,从而可以更平衡地提取关键特征。

结论

我们提出了 BASE 网络服务,提供了现有模型的亲和力预测结果和降低偏差的数据集。这些资源有助于开发通用且稳健的预测模型,提高药物发现过程中 DTA 预测的准确性和可靠性。BASE 可免费在线获得,网址为 https://synbi2024.kaist.ac.kr/base。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/644c/11526688/d592f6f3b78e/12859_2024_5968_Fig1_HTML.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验