一种用于创建高质量蛋白质-配体结合数据集以进行训练、验证和预测任务的工作流程。

A workflow to create a high-quality protein-ligand binding dataset for training, validation, and prediction tasks.

作者信息

Wang Yingze, Sun Kunyang, Li Jie, Guan Xingyi, Zhang Oufan, Bagni Dorian, Zhang Yang, Carlson Heather A, Head-Gordon Teresa

机构信息

Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA.

Department of Computer Science, School of Computing, National University of Singapore 117417 Singapore.

出版信息

Digit Discov. 2025 Apr 2;4(5):1209-1220. doi: 10.1039/d4dd00357h. eCollection 2025 May 14.

DOI:10.1039/d4dd00357h

PMID:40190768

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11967698/

Abstract

Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and Binding DB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.

摘要

用于预测蛋白质-配体结合能的评分函数（SFs）的开发需要高质量的三维结构和结合测定数据来训练和测试其参数。在这项工作中，我们表明，广泛使用的数据集之一PDBbind存在蛋白质和配体的几个常见结构伪影，这可能会损害所得评分函数的准确性、可靠性和通用性。因此，我们开发了一系列以半自动工作流程HiQBind-WF组织的算法，该工作流程可整理非共价蛋白质-配体数据集以解决这些问题。我们还使用此工作流程通过匹配来自BioLiP、Binding MOAD和Binding DB等各种来源的结合自由能与来自PDB的共结晶配体-蛋白质复合物，创建了一个独立的数据集HiQBind。所得的HiQBind工作流程和数据集旨在确保可重复性并尽量减少人为干预，同时也是开源的，以促进对这一生物学和药物发现社区重要资源所做改进的透明度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5ffe/11967698/54444a48e6d1/d4dd00357h-f1.jpg

相似文献

A workflow to create a high-quality protein-ligand binding dataset for training, validation, and prediction tasks.

Digit Discov. 2025 Apr 2;4(5):1209-1220. doi: 10.1039/d4dd00357h. eCollection 2025 May 14.

A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks.

ArXiv. 2025 Mar 7:arXiv:2411.01223v2.

Leak Proof PDBBind: A Reorganized Dataset of Protein-Ligand Complexes for More Generalizable Binding Affinity Prediction.

ArXiv. 2024 May 3:arXiv:2308.09639v2.

BgN-Score and BsN-Score: bagging and boosting based ensemble neural networks scoring functions for accurate binding affinity prediction of protein-ligand complexes.

BMC Bioinformatics. 2015;16 Suppl 4(Suppl 4):S8. doi: 10.1186/1471-2105-16-S4-S8. Epub 2015 Feb 23.

Comparative evaluation of methods for the prediction of protein-ligand binding sites.

J Cheminform. 2024 Nov 11;16(1):126. doi: 10.1186/s13321-024-00923-z.

Forging the Basis for Developing Protein-Ligand Interaction Scoring Functions.

Acc Chem Res. 2017 Feb 21;50(2):302-309. doi: 10.1021/acs.accounts.6b00491. Epub 2017 Feb 9.

BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions.

Nucleic Acids Res. 2013 Jan;41(Database issue):D1096-103. doi: 10.1093/nar/gks966. Epub 2012 Oct 18.

A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2015 Mar-Apr;12(2):335-47. doi: 10.1109/TCBB.2014.2351824.

A comparative assessment of ranking accuracies of conventional and machine-learning-based scoring functions for protein-ligand binding affinity prediction.

IEEE/ACM Trans Comput Biol Bioinform. 2012 Sep-Oct;9(5):1301-13. doi: 10.1109/TCBB.2012.36.

Structural artifacts in protein-ligand X-ray structures: implications for the development of docking scoring functions.

J Med Chem. 2009 Sep 24;52(18):5673-84. doi: 10.1021/jm8016464.

引用本文的文献

Simpatico: accurate and ultra-fast virtual drug screening with atomic embeddings.

bioRxiv. 2025 Jun 8:2025.06.08.658499. doi: 10.1101/2025.06.08.658499.

本文引用的文献

BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data.

Nucleic Acids Res. 2025 Jan 6;53(D1):D1633-D1644. doi: 10.1093/nar/gkae1075.

Bridging Machine Learning and Thermodynamics for Accurate p Prediction.

JACS Au. 2024 Jul 17;4(9):3451-3465. doi: 10.1021/jacsau.4c00271. eCollection 2024 Sep 23.

OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials.

J Phys Chem B. 2024 Jan 11;128(1):109-116. doi: 10.1021/acs.jpcb.3c06662. Epub 2023 Dec 28.

The maximal and current accuracy of rigorous protein-ligand binding free energy calculations.

Commun Chem. 2023 Oct 14;6(1):222. doi: 10.1038/s42004-023-01019-9.

BioLiP2: an updated structure database for biologically relevant ligand-protein interactions.

Nucleic Acids Res. 2024 Jan 5;52(D1):D404-D412. doi: 10.1093/nar/gkad630.

Development and Benchmarking of Open Force Field 2.0.0: The Sage Small Molecule Force Field.

J Chem Theory Comput. 2023 Jun 13;19(11):3251-3275. doi: 10.1021/acs.jctc.3c00039. Epub 2023 May 11.

Epik: p and Protonation State Prediction through Machine Learning.

J Chem Theory Comput. 2023 Apr 25;19(8):2380-2388. doi: 10.1021/acs.jctc.3c00044. Epub 2023 Apr 6.

Sunsetting Binding MOAD with its last data update and the addition of 3D-ligand polypharmacology tools.

Sci Rep. 2023 Feb 21;13(1):3008. doi: 10.1038/s41598-023-29996-w.

Geometric Interaction Graph Neural Network for Predicting Protein-Ligand Binding Affinities from 3D Structures (GIGN).

J Phys Chem Lett. 2023 Mar 2;14(8):2020-2033. doi: 10.1021/acs.jpclett.2c03906. Epub 2023 Feb 16.

CovBinderInPDB: A Structure-Based Covalent Binder Database.

J Chem Inf Model. 2022 Dec 12;62(23):6057-6068. doi: 10.1021/acs.jcim.2c01216. Epub 2022 Dec 1.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种用于创建高质量蛋白质-配体结合数据集以进行训练、验证和预测任务的工作流程。

A workflow to create a high-quality protein-ligand binding dataset for training, validation, and prediction tasks.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献