Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), 45137-66731, Zanjan, Iran.
Sci Rep. 2022 Jan 10;12(1):410. doi: 10.1038/s41598-021-04448-5.
Despite considerable advances obtained by applying machine learning approaches in protein-ligand affinity predictions, the incorporation of receptor flexibility has remained an important bottleneck. While ensemble docking has been used widely as a solution to this problem, the optimum choice of receptor conformations is still an open question considering the issues related to the computational cost and false positive pose predictions. Here, a combination of ensemble learning and ensemble docking is suggested to rank different conformations of the target protein in light of their importance for the final accuracy of the model. Available X-ray structures of cyclin-dependent kinase 2 (CDK2) in complex with different ligands are used as an initial receptor ensemble, and its redundancy is removed through a graph-based redundancy removal, which is shown to be more efficient and less subjective than clustering-based representative selection methods. A set of ligands with available experimental affinity are docked to this nonredundant receptor ensemble, and the energetic features of the best scored poses are used in an ensemble learning procedure based on the random forest method. The importance of receptors is obtained through feature selection measures, and it is shown that a few of the most important conformations are sufficient to reach 1 kcal/mol accuracy in affinity prediction with considerable improvement of the early enrichment power of the models compared to the different ensemble docking without learning strategies. A clear strategy has been provided in which machine learning selects the most important experimental conformers of the receptor among a large set of protein-ligand complexes while simultaneously maintaining the final accuracy of affinity predictions at the highest level possible for available data. Our results could be informative for future attempts to design receptor-specific docking-rescoring strategies.
尽管应用机器学习方法在蛋白质配体亲和力预测方面取得了相当大的进展,但受体柔性的纳入仍然是一个重要的瓶颈。虽然集合对接已被广泛用作解决此问题的一种方法,但考虑到与计算成本和假阳性构象预测相关的问题,受体构象的最佳选择仍然是一个悬而未决的问题。在这里,建议将集合学习和集合对接相结合,根据目标蛋白构象对模型最终准确性的重要性对其进行排序。将不同配体与细胞周期蛋白依赖性激酶 2(CDK2)结合的 X 射线结构用作初始受体集合,并通过基于图的冗余消除来去除其冗余,与基于聚类的代表性选择方法相比,该方法更有效且更客观。将一组具有可用实验亲和力的配体对接到此非冗余受体集合中,并在基于随机森林方法的集合学习过程中使用最佳得分构象的能量特征。通过特征选择措施获得受体的重要性,并表明只需几个最重要的构象即可达到亲和力预测 1 kcal/mol 的精度,并且与没有学习策略的不同集合对接相比,模型的早期富集能力得到了相当大的提高。提供了一种明确的策略,即机器学习在一组大型蛋白质-配体复合物中选择受体的最重要实验构象,同时保持亲和力预测的最终准确性尽可能高。我们的结果可为未来设计受体特异性对接重评分策略提供信息。