Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, USA.
Bioinformatics. 2022 Sep 16;38(Suppl_2):ii68-ii74. doi: 10.1093/bioinformatics/btac470.
Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.
To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.
Data and source codes are available at https://github.com/Shen-Lab/CPAC.
Supplementary data are available at Bioinformatics online.
化合物-蛋白质亲和力和接触(CPAC)预测的计算方法旨在通过同时预测化合物-蛋白质相互作用的强度和模式来促进合理的药物发现。尽管所需的输出高度依赖于结构,但缺乏蛋白质结构往往使得无结构方法仅依赖于蛋白质序列输入。具有亲和力和接触标签的化合物-蛋白质对的稀缺性进一步限制了 CPAC 模型的准确性和泛化能力。
为了克服上述结构盲目性和标记数据稀缺性的挑战,我们分别引入了跨模态和自监督学习,用于有感知结构和相关任务的蛋白质嵌入。具体来说,蛋白质数据在 1D 氨基酸序列和预测的 2D 接触图这两种模态中都可用,分别使用递归神经网络和图神经网络进行嵌入,以及使用两种跨模态方案进行联合嵌入。此外,两种蛋白质模态都在各种自监督学习策略下进行了预训练,利用了大量未标记的蛋白质数据。我们的结果表明,单独的蛋白质模态在预测亲和力或接触方面的能力存在差异。适当的跨模态蛋白质嵌入结合自监督学习可以提高模型在预测未见蛋白质的亲和力和接触时的泛化能力。
数据和源代码可在 https://github.com/Shen-Lab/CPAC 上获得。
补充数据可在生物信息学在线获得。