Morehead Alex, Giri Nabin, Liu Jian, Neupane Pawan, Cheng Jianlin
Electrical Engineering & Computer Science, NextGen Precision Health, University of Missouri, Columbia, Missouri, USA.
ArXiv. 2025 Feb 9:arXiv:2405.14108v5.
The effects of ligand binding on protein structures and their functions carry numerous implications for modern biomedical research and biotechnology development efforts such as drug discovery. Although several deep learning (DL) methods and benchmarks designed for protein-ligand docking have recently been introduced, to date no prior works have systematically studied the behavior of the latest docking and structure prediction methods within the context of (1) using predicted (apo) protein structures for docking (e.g., for applicability to new proteins); (2) binding multiple (cofactor) ligands concurrently to a given target protein (e.g., for enzyme design); and (3) having no prior knowledge of binding pockets (e.g., for generalization to unknown pockets). To enable a deeper understanding of docking methods' real-world utility, we introduce PoseBench, the first comprehensive benchmark for protein-ligand docking. PoseBench enables researchers to rigorously and systematically evaluate DL methods for apo-to-holo protein-ligand docking and protein-ligand structure prediction using primary ligand and multi-ligand benchmark datasets, the latter of which we introduce for the first time to the DL community. Empirically, using PoseBench, we find that (1) DL co-folding methods generally outperform comparable conventional and DL docking baselines, yet popular methods such as AlphaFold 3 are still challenged by prediction targets with novel protein sequences; (2) certain DL co-folding methods are highly sensitive to their input multiple sequence alignments, while others are not; and (3) DL methods struggle to strike a balance between structural accuracy and chemical specificity when predicting novel or multi-ligand protein targets. Code, data, tutorials, and benchmark results are available at https://github.com/BioinfoMachineLearning/PoseBench.
配体结合对蛋白质结构及其功能的影响对现代生物医学研究和生物技术开发工作(如药物发现)具有诸多意义。尽管最近引入了几种专为蛋白质 - 配体对接设计的深度学习(DL)方法和基准,但迄今为止,尚无先前的工作在以下背景下系统地研究最新对接和结构预测方法的行为:(1)使用预测的(无配体)蛋白质结构进行对接(例如,适用于新蛋白质);(2)将多个(辅因子)配体同时结合到给定的目标蛋白质上(例如,用于酶设计);以及(3)对结合口袋没有先验知识(例如,推广到未知口袋)。为了更深入地了解对接方法在实际中的效用,我们引入了PoseBench,这是第一个用于蛋白质 - 配体对接的综合基准。PoseBench使研究人员能够使用主要配体和多配体基准数据集,严格且系统地评估用于从无配体到有配体的蛋白质 - 配体对接和蛋白质 - 配体结构预测的DL方法,我们首次将多配体基准数据集引入到DL社区。从经验上来说,使用PoseBench,我们发现:(1)DL共折叠方法通常优于可比的传统和DL对接基线,但诸如AlphaFold 3等流行方法仍然受到具有新蛋白质序列的预测目标的挑战;(2)某些DL共折叠方法对其输入的多序列比对高度敏感,而其他方法则不然;并且(3)在预测新的或多配体蛋白质目标时,DL方法难以在结构准确性和化学特异性之间取得平衡。代码、数据、教程和基准结果可在https://github.com/BioinfoMachineLearning/PoseBench上获取。