Schaller David A, Christ Clara D, Chodera John D, Volkamer Andrea
In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany.
Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, New York 10065, United States.
J Chem Inf Model. 2024 Dec 9;64(23):8848-8858. doi: 10.1021/acs.jcim.4c00905. Epub 2024 Nov 18.
In recent years, machine learning has transformed many aspects of the drug discovery process, including small molecule design, for which the prediction of bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches but is fundamentally limited by the accuracy with which protein-ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase-inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures cocrystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the cocrystallized ligand, utilizing shape overlap with or without maximum common substructure matching, are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance of generating a low root-mean-square deviation (RMSD) docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar cocrystallized ligands according to the maximum common substructure (MCS) proved to be the most efficient way to reproduce binding poses, achieving a success rate of 70.4% across all included systems. The studied docking and pose selection strategies, which utilize the OpenEye Toolkits, were implemented into pipelines of the KinoML framework, allowing automated and reliable protein-ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe that the general findings can also be transferred to other protein families.
近年来,机器学习改变了药物发现过程的许多方面,包括小分子设计,而生物活性预测是其中不可或缺的一部分。利用小分子与其蛋白质靶点之间相互作用的结构信息,对于下游机器学习评分方法具有巨大潜力,但从根本上受到能否以可靠且自动化的方式预测蛋白质-配体复合物结构准确性的限制。为了找到实用的方法来生成用于下游机器学习评分方法的有用激酶-抑制剂复合物几何结构,我们提出了一个以激酶为中心的对接基准,评估不同类别的对接和构象选择策略的性能,以评估在实际交叉对接场景中实验观察到的结合模式能被重现的程度。组装的基准数据集聚焦于研究充分的蛋白激酶家族,包含与423种ATP竞争性配体共结晶的589个蛋白质结构的子集。我们发现,受共结晶配体影响的对接方法,利用形状重叠(有无最大公共子结构匹配),比单独的基于标准物理的对接在恢复结合构象方面更成功。此外,对接多个结构显著增加了生成低均方根偏差(RMSD)对接构象的机会。根据最大公共子结构(MCS)将结合所有三种方法(Posit)的方法对接至具有最相似共结晶配体的结构,被证明是重现结合构象的最有效方法,在所有纳入的系统中成功率达到70.4%。利用OpenEye工具包的研究对接和构象选择策略被实施到KinoML框架的流程中,为未来下游机器学习任务实现了自动化且可靠的蛋白质-配体复合物生成。尽管聚焦于蛋白激酶,但我们相信这些普遍发现也可应用于其他蛋白质家族。