Li Jie, Guan Xingyi, Zhang Oufan, Sun Kunyang, Wang Yingze, Bagni Dorian, Head-Gordon Teresa
ArXiv. 2024 May 3:arXiv:2308.09639v2.
Many physics-based and machine-learned scoring functions (SFs) used to predict protein-ligand binding free energies have been trained on the PDBBind dataset. However, it is controversial as to whether new SFs are actually improving since the general, refined, and core datasets of PDBBind are cross-contaminated with proteins and ligands with high similarity, and hence they may not perform comparably well in binding prediction of new protein-ligand complexes. In this work we have carefully prepared a cleaned PDBBind data set of non-covalent binders that are split into training, validation, and test datasets to control for data leakage, defined as proteins and ligands with high sequence and structural similarity. The resulting leak-proof (LP)-PDBBind data is used to retrain four popular SFs: AutoDock Vina, Random Forest (RF)-Score, InteractionGraphNet (IGN), and DeepDTA, to better test their capabilities when applied to new protein-ligand complexes. In particular we have formulated a new independent data set, BDB2020+, by matching high quality binding free energies from BindingDB with co-crystalized ligand-protein complexes from the PDB that have been deposited since 2020. Based on all the benchmark results, the retrained models using LP-PDBBind consistently perform better, with IGN especially being recommended for scoring and ranking applications for new protein-ligand systems.
许多用于预测蛋白质-配体结合自由能的基于物理和机器学习的评分函数(SFs)都是在PDBBind数据集上训练的。然而,新的评分函数是否真的有所改进存在争议,因为PDBBind的通用、精炼和核心数据集与具有高度相似性的蛋白质和配体存在交叉污染,因此它们在新的蛋白质-配体复合物的结合预测中可能表现不佳。在这项工作中,我们精心准备了一个经过清理的非共价结合剂PDBBind数据集,该数据集被分为训练、验证和测试数据集,以控制数据泄露,数据泄露定义为具有高度序列和结构相似性的蛋白质和配体。由此产生的防泄漏(LP)-PDBBind数据用于重新训练四种流行的评分函数:AutoDock Vina、随机森林(RF)-Score、InteractionGraphNet(IGN)和DeepDTA,以便在应用于新的蛋白质-配体复合物时更好地测试它们的能力。特别是,我们通过将BindingDB中的高质量结合自由能与自2020年以来沉积的PDB中共结晶的配体-蛋白质复合物相匹配,制定了一个新的独立数据集BDB2020+。基于所有基准测试结果,使用LP-PDBBind重新训练的模型始终表现更好,尤其推荐IGN用于新蛋白质-配体系统的评分和排名应用。