用于基于结构的药物设计的三维卷积神经网络和交叉对接数据集

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design.

作者信息

Francoeur Paul G, Masuda Tomohide, Sunseri Jocelyn, Jia Andrew, Iovanisci Richard B, Snyder Ian, Koes David R

机构信息

Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States.

出版信息

J Chem Inf Model. 2020 Sep 28;60(9):4200-4215. doi: 10.1021/acs.jcim.0c00411. Epub 2020 Sep 10.

DOI:10.1021/acs.jcim.0c00411

PMID:32865404

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8902699/

Abstract

One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard data set of sufficient size to compare performance between models. We present a new data set for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank, and perform a comprehensive evaluation of grid-based convolutional neural network (CNN) models on this data set. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind data set, how performance improves by adding more lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of five densely connected CNNs, achieves a root mean squared error of 1.42 and Pearson of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized data set for training machine learning models to recognize ligands in noncognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this data set for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.

摘要

药物研发中的主要挑战之一是预测蛋白质-配体结合亲和力。最近，机器学习方法在这项任务上取得了重大进展。然而，当前的模型评估方法在衡量对新靶点的泛化能力时过于乐观，并且不存在足够大的标准数据集来比较模型之间的性能。我们提出了一个用于基于结构的机器学习的新数据集CrossDocked2020集，其中包含2250万个配体构象对接至蛋白质数据库中多个相似结合口袋的结果，并对基于网格的卷积神经网络（CNN）模型在该数据集上进行了全面评估。我们还展示了训练数据和测试数据的划分如何影响使用PDBbind数据集训练的模型的结果，增加更多质量较低的训练数据如何提高性能，以及使用对接构象进行训练如何使预测的复合物亲和力具有构象敏感性。我们表现最佳的模型是一个由五个密集连接的CNN组成的集成模型，在亲和力预测任务上实现了均方根误差为1.42，皮尔逊相关系数为0.612，在结合构象分类上的AUC为0.956，在CrossDocked2020集上的构象选择准确率为68.4%。通过提供用于聚类交叉验证的数据划分和CrossDocked2020集的原始数据，我们建立了第一个标准化数据集，用于训练机器学习模型以识别非同源靶标结构中的配体，同时也大大增加了可用于训练的构象数量。为了便于社区采用此数据集来基准测试蛋白质-配体结合亲和力预测，我们在https://github.com/gnina/models上提供了我们的模型、权重和CrossDocked2020集。

相似文献

Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design.用于基于结构的药物设计的三维卷积神经网络和交叉对接数据集

J Chem Inf Model. 2020 Sep 28;60(9):4200-4215. doi: 10.1021/acs.jcim.0c00411. Epub 2020 Sep 10.

Expanding Training Data for Structure-Based Receptor-Ligand Binding Affinity Regression through Imputation of Missing Labels.通过缺失标签插补扩展基于结构的受体-配体结合亲和力回归的训练数据。

ACS Omega. 2023 Oct 26;8(44):41680-41688. doi: 10.1021/acsomega.3c05931. eCollection 2023 Nov 7.

The impact of cross-docked poses on performance of machine learning classifier for protein-ligand binding pose prediction.交叉对接构象对用于蛋白质-配体结合构象预测的机器学习分类器性能的影响。

J Cheminform. 2021 Oct 16;13(1):81. doi: 10.1186/s13321-021-00560-w.

Boosted neural networks scoring functions for accurate ligand docking and ranking.用于精确配体对接和排序的增强神经网络评分函数。

J Bioinform Comput Biol. 2018 Apr;16(2):1850004. doi: 10.1142/S021972001850004X. Epub 2018 Feb 4.

Complex machine learning model needs complex testing: Examining predictability of molecular binding affinity by a graph neural network.复杂的机器学习模型需要复杂的测试：通过图神经网络检验分子结合亲和力的可预测性。

J Comput Chem. 2022 Apr 15;43(10):728-739. doi: 10.1002/jcc.26831. Epub 2022 Feb 24.

AK-Score: Accurate Protein-Ligand Binding Affinity Prediction Using an Ensemble of 3D-Convolutional Neural Networks.AK-Score：使用 3D 卷积神经网络集成进行准确的蛋白质-配体结合亲和力预测。

Int J Mol Sci. 2020 Nov 10;21(22):8424. doi: 10.3390/ijms21228424.

HAC-Net: A Hybrid Attention-Based Convolutional Neural Network for Highly Accurate Protein-Ligand Binding Affinity Prediction.HAC-Net：一种基于混合注意力的卷积神经网络，用于高精度蛋白质-配体结合亲和力预测。

J Chem Inf Model. 2023 Apr 10;63(7):1947-1960. doi: 10.1021/acs.jcim.3c00251. Epub 2023 Mar 29.

Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained on Docked Poses.从对接配体中学习：当基于配体的特征基于对接构象进行训练时，可以挽救基于结构的评分函数。

J Chem Inf Model. 2022 Nov 28;62(22):5329-5341. doi: 10.1021/acs.jcim.1c00096. Epub 2021 Sep 1.

Pose Classification Using Three-Dimensional Atomic Structure-Based Neural Networks Applied to Ion Channel-Ligand Docking.基于三维原子结构的神经网络在离子通道配体对接中的姿势分类应用。

J Chem Inf Model. 2022 May 23;62(10):2301-2315. doi: 10.1021/acs.jcim.1c01510. Epub 2022 Apr 21.

Improved Protein-Ligand Binding Affinity Prediction with Structure-Based Deep Fusion Inference.基于结构的深度融合推理提高蛋白-配体结合亲和力预测。

J Chem Inf Model. 2021 Apr 26;61(4):1583-1592. doi: 10.1021/acs.jcim.0c01306. Epub 2021 Mar 23.

引用本文的文献

Target-aware 3D molecular generation based on guided equivariant diffusion.基于引导等变扩散的目标感知三维分子生成

Nat Commun. 2025 Aug 25;16(1):7928. doi: 10.1038/s41467-025-63245-0.

Spatio-temporal learning from molecular dynamics simulations for protein-ligand binding affinity prediction.基于分子动力学模拟的时空学习用于蛋白质-配体结合亲和力预测。

Bioinformatics. 2025 Aug 2;41(8). doi: 10.1093/bioinformatics/btaf429.

Palmitoyl-Epigallocatechin Gallate Modulates COX-2-Based Production of Inflammation-Related Oxylipins: Synthesis, Characterization, and Bioevaluation In Vitro and In Silico.棕榈酰表没食子儿茶素没食子酸酯调节基于COX-2的炎症相关氧化脂质生成：体外和计算机模拟的合成、表征及生物学评价

ACS Omega. 2025 Jul 29;10(31):34917-34929. doi: 10.1021/acsomega.5c04117. eCollection 2025 Aug 12.

A Structure-Based Computational Pipeline for Broad-Spectrum Antiviral Discovery.一种基于结构的广谱抗病毒药物发现计算流程。

bioRxiv. 2025 Jul 30:2025.07.29.667267. doi: 10.1101/2025.07.29.667267.

Digital Alchemy: The Rise of Machine and Deep Learning in Small-Molecule Drug Discovery.数字炼金术：小分子药物发现中机器学习与深度学习的兴起

Int J Mol Sci. 2025 Jul 16;26(14):6807. doi: 10.3390/ijms26146807.

Benchmarking 3D Structure-Based Molecule Generators.基于3D结构的分子生成器的基准测试

J Chem Inf Model. 2025 Aug 11;65(15):8006-8021. doi: 10.1021/acs.jcim.5c01020. Epub 2025 Jul 25.

Assay2Mol: large language model-based drug design using BioAssay context.分析到分子：基于大语言模型并利用生物分析背景的药物设计

ArXiv. 2025 Jul 16:arXiv:2507.12574v1.

A 3D pocket-aware lead optimization model with knowledge guidance and its application for discovery of new glutaminyl cyclase inhibitors.一种具有知识导向的三维口袋感知型先导化合物优化模型及其在新型谷氨酰胺环化酶抑制剂发现中的应用

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf345.

OrgNet: orientation-gnostic protein stability assessment using convolutional neural networks.OrgNet：使用卷积神经网络进行方向无关的蛋白质稳定性评估。

Bioinformatics. 2025 Jul 1;41(Supplement_1):i458-i465. doi: 10.1093/bioinformatics/btaf252.

CoBdock-2: enhancing blind docking performance through hybrid feature selection combining ensemble and multimodel feature selection approaches.CoBdock-2：通过结合集成和多模型特征选择方法的混合特征选择提高盲对接性能。

J Comput Aided Mol Des. 2025 Jul 13;39(1):48. doi: 10.1007/s10822-025-00629-w.

本文引用的文献

LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening.LIT-PCBA：用于机器学习和虚拟筛选的无偏数据集。

J Chem Inf Model. 2020 Sep 28;60(9):4263-4273. doi: 10.1021/acs.jcim.0c00155. Epub 2020 Apr 23.

Combining Docking Pose Rank and Structure with Deep Learning Improves Protein-Ligand Binding Mode Prediction over a Baseline Docking Approach.结合对接构象排序和深度学习可提高基于对接方法的蛋白-配体结合模式预测。

J Chem Inf Model. 2020 Sep 28;60(9):4170-4179. doi: 10.1021/acs.jcim.9b00927. Epub 2020 Mar 3.

libmolgrid: Graphics Processing Unit Accelerated Molecular Gridding for Deep Learning Applications.Libmolgrid：用于深度学习应用的图形处理单元加速分子网格化

J Chem Inf Model. 2020 Mar 23;60(3):1079-1084. doi: 10.1021/acs.jcim.9b01145. Epub 2020 Feb 26.

Cross-docking benchmark for automated pose and ranking prediction of ligand binding.配体结合的自动构象和排序预测的交叉对接基准

Protein Sci. 2020 Jan;29(1):298-305. doi: 10.1002/pro.3784. Epub 2019 Nov 28.

Learning from the ligand: using ligand-based features to improve binding affinity prediction.从配体中学习：利用基于配体的特征来提高结合亲和力预测。

Bioinformatics. 2020 Feb 1;36(3):758-764. doi: 10.1093/bioinformatics/btz665.

Predicting Drug-Target Interaction Using a Novel Graph Neural Network with 3D Structure-Embedded Graph Representation.利用具有 3D 结构嵌入图表示的新型图神经网络预测药物-靶标相互作用。

J Chem Inf Model. 2019 Sep 23;59(9):3981-3988. doi: 10.1021/acs.jcim.9b00387. Epub 2019 Sep 6.

Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening.DUD-E 数据集的隐藏偏差导致基于结构的虚拟筛选中深度学习的性能产生误导。

PLoS One. 2019 Aug 20;14(8):e0220113. doi: 10.1371/journal.pone.0220113. eCollection 2019.

In Need of Bias Control: Evaluating Chemical Data for Machine Learning in Structure-Based Virtual Screening.需要进行偏差控制：在基于结构的虚拟筛选中评估机器学习的化学数据。

J Chem Inf Model. 2019 Mar 25;59(3):947-961. doi: 10.1021/acs.jcim.8b00712. Epub 2019 Mar 5.

Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction Using Deep Learning.基于深度学习的序列结合预测中交叉验证策略的评估。

J Chem Inf Model. 2019 Apr 22;59(4):1645-1657. doi: 10.1021/acs.jcim.8b00663. Epub 2019 Feb 22.

DeepDTA: deep drug-target binding affinity prediction.深度 DTA：深度药物-靶标结合亲和力预测。

Bioinformatics. 2018 Sep 1;34(17):i821-i829. doi: 10.1093/bioinformatics/bty593.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验