Hong Yiyu, Ha Junsu, Sim Jaemin, Lim Chae Jo, Oh Kwang-Seok, Chandrasekaran Ramakrishnan, Kim Bomin, Choi Jieun, Ko Junsu, Shin Woong-Hee, Lee Juyong
Arontier Co., 241, Gangnam-daero, Seocho-gu, Seoul, 06735, Republic of Korea.
Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, 08826, Republic of Korea.
J Cheminform. 2024 Nov 4;16(1):121. doi: 10.1186/s13321-024-00912-2.
We introduce an advanced model for predicting protein-ligand interactions. Our approach combines the strengths of graph neural networks with physics-based scoring methods. Existing structure-based machine-learning models for protein-ligand binding prediction often fall short in practical virtual screening scenarios, hindered by the intricacies of binding poses, the chemical diversity of drug-like molecules, and the scarcity of crystallographic data for protein-ligand complexes. To overcome the limitations of existing machine learning-based prediction models, we propose a novel approach that fuses three independent neural network models. One classification model is designed to perform binary prediction of a given protein-ligand complex pose. The other two regression models are trained to predict the binding affinity and root-mean-square deviation of a ligand conformation from an input complex structure. We trained the model to account for both deviations in experimental and predicted binding affinities and pose prediction uncertainties. By effectively integrating the outputs of the triplet neural networks with a physics-based scoring function, our model showed a significantly improved performance in hit identification. The benchmark results with three independent decoy sets demonstrate that our model outperformed existing models in forward screening. Our model achieved top 1% enrichment factors of 32.7 and 23.1 with the CASF2016 and DUD-E benchmark sets, respectively. The benchmark results using the LIT-PCBA set further confirmed its higher average enrichment factors, emphasizing the model's efficiency and generalizability. The model's efficiency was further validated by identifying 23 active compounds from 63 candidates in experimental screening for autotaxin inhibitors, demonstrating its practical applicability in hit discovery.Scientific contributionOur work introduces a novel training strategy for a protein-ligand binding affinity prediction model by integrating the outputs of three independent sub-models and utilizing expertly crafted decoy sets. The model showcases exceptional performance across multiple benchmarks. The high enrichment factors in the LIT-PCBA benchmark demonstrate its potential to accelerate hit discovery.
我们介绍了一种用于预测蛋白质-配体相互作用的先进模型。我们的方法结合了图神经网络的优势和基于物理的评分方法。现有的基于结构的蛋白质-配体结合预测机器学习模型在实际虚拟筛选场景中往往存在不足,受到结合姿势的复杂性、类药物分子的化学多样性以及蛋白质-配体复合物晶体学数据的稀缺性的阻碍。为了克服现有基于机器学习的预测模型的局限性,我们提出了一种融合三个独立神经网络模型的新方法。一个分类模型旨在对给定的蛋白质-配体复合物姿势进行二元预测。另外两个回归模型经过训练,用于预测配体构象相对于输入复合物结构的结合亲和力和均方根偏差。我们训练该模型以考虑实验和预测结合亲和力的偏差以及姿势预测的不确定性。通过有效地将三元神经网络的输出与基于物理的评分函数相结合,我们的模型在命中识别方面表现出显著提高的性能。使用三个独立诱饵集的基准测试结果表明,我们的模型在前向筛选中优于现有模型。我们的模型在CASF2016和DUD-E基准测试集中分别实现了32.7和23.1的前1%富集因子。使用LIT-PCBA集的基准测试结果进一步证实了其更高的平均富集因子,强调了该模型的效率和通用性。通过在自分泌运动因子抑制剂的实验筛选中从63个候选物中鉴定出23种活性化合物,进一步验证了该模型的效率,证明了其在命中发现中的实际适用性。
科学贡献
我们的工作通过整合三个独立子模型的输出并利用精心设计的诱饵集,为蛋白质-配体结合亲和力预测模型引入了一种新的训练策略。该模型在多个基准测试中表现出卓越的性能。LIT-PCBA基准测试中的高富集因子证明了其加速命中发现的潜力。