Long Yuxi, Donald Bruce R
bioRxiv. 2024 Oct 21:2023.11.16.567384. doi: 10.1101/2023.11.16.567384.
Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions . Next, we introduce , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be , number of ligand atoms be , and ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 1, after an 𝒪 ( log( )) preprocessing procedure, we can compute an -accurate approximation to the persistence fingerprint in 𝒪 ( log ( )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( + ) ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH and PATH . PATH is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.
准确的结合亲和力预测对于基于结构的药物设计至关重要。最近的工作利用计算拓扑来获得蛋白质 - 配体相互作用的有效表示。虽然使用代数拓扑的算法已被证明在预测生物分子特性方面很有用,但以前的算法采用了不可解释的机器学习模型,这些模型无法解释驱动准确结合亲和力预测的潜在几何和拓扑特征。此外,它们具有很高的计算复杂度,这使得它们对于大型蛋白质来说难以处理。我们提出了已知最快的算法,使用对立距离来计算蛋白质 - 配体复合物的持久同调特征,其运行时间与蛋白质大小无关。然后,我们在一种新颖的、可解释的算法中利用这些特征来预测蛋白质 - 配体结合亲和力。我们的算法通过将蛋白质和配体原子的二分匹配上的距离有效地嵌入到实值函数中实现可解释性,方法是对以持久同调构建的特征为中心的高斯函数求和。我们将这些函数命名为 。接下来,我们引入 ,一个具有10个分量的向量,它描绘了从IPC细化得到的蛋白质和配体原子之间不同二分匹配的距离。设蛋白质 - 配体复合物中蛋白质原子的数量为 ,配体原子的数量为 ,且 ≈ 2.4为矩阵乘法指数。我们表明,对于任何0 1,经过一个𝒪 ( log( ))的预处理过程后,我们可以在𝒪 ( log ( ))时间内计算出持久指纹的 -准确近似值,且与蛋白质大小无关。这在时间复杂度上比任何以前使用持久同调的结合亲和力预测提高了𝒪 (( + ) )倍。我们表明持久指纹的表示能力可以推广到训练数据集之外的蛋白质 - 配体结合数据集。然后,我们引入 ,即通过同调预测亲和力,这是一种由PATH 和PATH 组成的两部分算法。PATH 是一个可解释的、由浅层回归树组成的小集成,用于从持久指纹预测结合亲和力。我们表明,尽管使用的特征少了1400倍,但PATH 与之前使用持久同调的最先进结合亲和力预测算法具有可比的性能。此外,PATH 具有可解释的优点。我们可视化了变异HIV - 1蛋白酶复合物的持久指纹捕获的特征,并表明持久指纹捕获了与结合相关的结构突变。反过来,PATH 使用IPC上的回归树来区分结合复合物和诱饵复合物。最后,我们将PATH与基于物理、基于知识和深度学习方法的既定结合亲和力预测算法进行了基准测试,结果表明与这些最先进的方法相比,PATH具有可比或更好的性能,且过拟合较少。PATH的源代码作为鱼鹰蛋白质设计软件包的一部分开源发布。