Schaller David, Christ Clara D, Chodera John D, Volkamer Andrea
In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Augustenburger Platz 1, 13353 Berlin, Germany.
Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA.
bioRxiv. 2023 Sep 14:2023.09.11.557138. doi: 10.1101/2023.09.11.557138.
In recent years machine learning has transformed many aspects of the drug discovery process including small molecule design for which the prediction of the bioactivity is an integral part. Leveraging structural information about the interactions between a small molecule and its protein target has great potential for downstream machine learning scoring approaches, but is fundamentally limited by the accuracy with which protein:ligand complex structures can be predicted in a reliable and automated fashion. With the goal of finding practical approaches to generating useful kinase:inhibitor complex geometries for downstream machine learning scoring approaches, we present a kinase-centric docking benchmark assessing the performance of different classes of docking and pose selection strategies to assess how well experimentally observed binding modes are recapitulated in a realistic cross-docking scenario. The assembled benchmark data set focuses on the well-studied protein kinase family and comprises a subset of 589 protein structures co-crystallized with 423 ATP-competitive ligands. We find that the docking methods biased by the co-crystallized ligand-utilizing shape overlap with or without maximum common substructure matching-are more successful in recovering binding poses than standard physics-based docking alone. Also, docking into multiple structures significantly increases the chance to generate a low RMSD docking pose. Docking utilizing an approach that combines all three methods (Posit) into structures with the most similar co-crystallized ligands according to shape and electrostatics proofed to be the most efficient way to reproduce binding poses achieving a success rate of 66.9 % across all included systems. The studied docking and pose selection strategies-which utilize the OpenEye Toolkit-were implemented into pipelines of the KinoML framework allowing automated and reliable protein:ligand complex generation for future downstream machine learning tasks. Although focused on protein kinases, we believe the general findings can also be transferred to other protein families.
近年来,机器学习改变了药物发现过程的许多方面,包括小分子设计,而生物活性预测是其中不可或缺的一部分。利用小分子与其蛋白质靶点之间相互作用的结构信息,对下游机器学习评分方法具有巨大潜力,但从根本上受到能否以可靠且自动化的方式预测蛋白质-配体复合物结构准确性的限制。为了找到切实可行的方法,为下游机器学习评分方法生成有用的激酶-抑制剂复合物几何结构,我们提出了一个以激酶为中心的对接基准,评估不同类别的对接和构象选择策略的性能,以评估在实际交叉对接场景中实验观察到的结合模式的重现程度。组装的基准数据集聚焦于研究充分的蛋白激酶家族,包含与423种ATP竞争性配体共结晶的589个蛋白质结构的子集。我们发现,受共结晶配体偏倚的对接方法——利用形状重叠或不利用最大公共子结构匹配——在恢复结合构象方面比单独的基于标准物理的对接更成功。此外,对接多个结构显著增加了生成低均方根偏差对接构象的机会。根据形状和静电将结合所有三种方法(Posit)对接至具有最相似共结晶配体的结构中,被证明是重现结合构象的最有效方法,在所有纳入系统中成功率达到66.9%。所研究的对接和构象选择策略——利用OpenEye工具包——已被纳入KinoML框架的流程中,可为未来下游机器学习任务自动且可靠地生成蛋白质-配体复合物。尽管聚焦于蛋白激酶,但我们相信这些一般性发现也可应用于其他蛋白质家族。