Wang Yingze, Sun Kunyang, Li Jie, Guan Xingyi, Zhang Oufan, Bagni Dorian, Zhang Yang, Carlson Heather A, Head-Gordon Teresa
Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA.
Department of Computer Science, School of Computing, National University of Singapore 117417 Singapore.
Digit Discov. 2025 Apr 2;4(5):1209-1220. doi: 10.1039/d4dd00357h. eCollection 2025 May 14.
Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and Binding DB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
用于预测蛋白质-配体结合能的评分函数(SFs)的开发需要高质量的三维结构和结合测定数据来训练和测试其参数。在这项工作中,我们表明,广泛使用的数据集之一PDBbind存在蛋白质和配体的几个常见结构伪影,这可能会损害所得评分函数的准确性、可靠性和通用性。因此,我们开发了一系列以半自动工作流程HiQBind-WF组织的算法,该工作流程可整理非共价蛋白质-配体数据集以解决这些问题。我们还使用此工作流程通过匹配来自BioLiP、Binding MOAD和Binding DB等各种来源的结合自由能与来自PDB的共结晶配体-蛋白质复合物,创建了一个独立的数据集HiQBind。所得的HiQBind工作流程和数据集旨在确保可重复性并尽量减少人为干预,同时也是开源的,以促进对这一生物学和药物发现社区重要资源所做改进的透明度。