Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas 77843, United States.
TEES-AgriLife Center for Bioinformatics and Genomic Systems Engineering, Texas A&M University, College Station, Texas 77840, United States.
J Chem Inf Model. 2021 Jan 25;61(1):46-66. doi: 10.1021/acs.jcim.0c00866. Epub 2020 Dec 21.
Predicting compound-protein affinity is beneficial for accelerating drug discovery. Doing so without the often-unavailable structure data is gaining interest. However, recent progress in structure-free affinity prediction, made by machine learning, focuses on accuracy but leaves much to be desired for interpretability. Defining intermolecular contacts underlying affinities as a vehicle for interpretability; our large-scale interpretability assessment finds previously used attention mechanisms inadequate. We thus formulate a hierarchical multiobjective learning problem, where predicted contacts form the basis for predicted affinities. We solve the problem by embedding protein sequences (by hierarchical recurrent neural networks) and compound graphs (by graph neural networks) with joint attentions between protein residues and compound atoms. We further introduce three methodological advances to enhance interpretability: (1) structure-aware regularization of attentions using protein sequence-predicted solvent exposure and residue-residue contact maps; (2) supervision of attentions using known intermolecular contacts in training data; and (3) an intrinsically explainable architecture where atomic-level contacts or "relations" lead to molecular-level affinity prediction. The first two and all three advances result in DeepAffinity+ and DeepRelations, respectively. Our methods show generalizability in affinity prediction for molecules that are new and dissimilar to training examples. Moreover, they show superior interpretability compared to state-of-the-art interpretable methods: with similar or better affinity prediction, they boost the AUPRC of contact prediction by around 33-, 35-, 10-, and 9-fold for the default test, new-compound, new-protein, and both-new sets, respectively. We further demonstrate their potential utilities in contact-assisted docking, structure-free binding site prediction, and structure-activity relationship studies without docking. Our study represents the first model development and systematic model assessment dedicated to interpretable machine learning for structure-free compound-protein affinity prediction.
预测化合物-蛋白质亲和力有助于加速药物发现。在没有经常不可用的结构数据的情况下进行预测越来越受到关注。然而,最近在无结构亲和力预测方面的机器学习进展侧重于准确性,但在可解释性方面还有很大的改进空间。我们将亲和力的分子间接触定义为可解释性的载体;我们的大规模可解释性评估发现以前使用的注意力机制不够充分。因此,我们提出了一个分层多目标学习问题,其中预测的接触是预测亲和力的基础。我们通过嵌入蛋白质序列(通过分层递归神经网络)和化合物图(通过图神经网络),并在蛋白质残基和化合物原子之间进行联合注意来解决这个问题。我们进一步引入了三个方法学上的改进来增强可解释性:(1)使用蛋白质序列预测的溶剂暴露和残基-残基接触图对注意力进行结构感知正则化;(2)在训练数据中使用已知的分子间接触来监督注意力;(3)一种内在可解释的架构,其中原子级别的接触或“关系”导致分子级别的亲和力预测。前两种方法和所有三种方法的改进分别导致了 DeepAffinity+和 DeepRelations。我们的方法在对与训练示例不同的新分子的亲和力预测中表现出了泛化能力。此外,与最先进的可解释方法相比,它们在可解释性方面表现出了优越性:在默认测试、新化合物、新蛋白质和两者新的数据集上,它们分别将接触预测的 AUPRC 提高了约 33 倍、35 倍、10 倍和 9 倍,而亲和力预测的性能相似或更好。我们进一步证明了它们在接触辅助对接、无结构结合位点预测和无对接的结构-活性关系研究中的潜在用途。我们的研究代表了第一个专门针对无结构化合物-蛋白质亲和力预测的可解释机器学习的模型开发和系统模型评估。