Libouban Pierre-Yves, Parisel Camille, Song Maxime, Aci-Sèche Samia, Gómez-Tamayo Jose C, Tresadern Gary, Bonnet Pascal
Institute of Organic and Analytical Chemistry (ICOA), UMR7311, Université d'Orléans, CNRS, Pôle de chimie rue de Chartres, 45067 Orléans Cedex 2, France.
Institute for Development and Resources in Intensive Scientific Computing (IDRIS), CNRS, Rue John Von Neumann, 91403 Orsay Cedex, France.
Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf429.
The field of protein-ligand binding affinity prediction continues to face significant challenges. While deep learning (DL) models can leverage 3D structural information of protein-ligand complexes, they perform well only on heavily biased test sets containing information leaked from training sets. This lack of generalization arises from the limited availability of training data and the models' inability to effectively learn from protein-ligand interactions. Since these interactions are inherently time-dependent, molecular dynamics (MD) simulations offer a potential solution by incorporating conformational sampling and providing interaction rich information.
We have developed MDbind, a dataset comprising 63 000 simulations of protein-ligand interactions, along with novel neural networks capable of learning from these simulations to predict binding affinity. By utilizing MD as data augmentation, our models achieved state-of-the-art performance on the PDBbind v.2016 core set and an external test set, the free energy perturbation (FEP) dataset. Additionally, when trained on the full MD simulations, the models demonstrated less biased predictions.
The code for neural networks is available at https://github.com/ICOA-SBC/MD_DL_BA. The models, the results and the training/validation/test sets are available for download at https://zenodo.org/records/10390550. The MDbind trajectories are being transferred to the MDDB: https://mmb-dev.mddbr.eu/#/browse? option=mdbind.
蛋白质-配体结合亲和力预测领域仍然面临重大挑战。虽然深度学习(DL)模型可以利用蛋白质-配体复合物的3D结构信息,但它们仅在包含从训练集中泄露信息的高度有偏测试集上表现良好。这种缺乏泛化性的情况源于训练数据的有限可用性以及模型无法有效地从蛋白质-配体相互作用中学习。由于这些相互作用本质上是时间依赖性的,分子动力学(MD)模拟通过纳入构象采样并提供丰富的相互作用信息提供了一种潜在的解决方案。
我们开发了MDbind,这是一个包含63000个蛋白质-配体相互作用模拟的数据集,以及能够从这些模拟中学习以预测结合亲和力的新型神经网络。通过将MD用作数据增强,我们的模型在PDBbind v.2016核心集和外部测试集自由能扰动(FEP)数据集上取得了领先的性能。此外,当在完整的MD模拟上进行训练时,模型表现出偏差较小的预测。