Suppr超能文献

跨模态和自监督的蛋白质嵌入方法用于化合物-蛋白质亲和力和接触预测。

Cross-modality and self-supervised protein embedding for compound-protein affinity and contact prediction.

机构信息

Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.

Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, USA.

出版信息

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii68-ii74. doi: 10.1093/bioinformatics/btac470.

Abstract

MOTIVATION

Computational methods for compound-protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound-protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound-protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.

RESULTS

To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

AVAILABILITY AND IMPLEMENTATION

Data and source codes are available at https://github.com/Shen-Lab/CPAC.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

摘要

动机

化合物-蛋白质亲和力和接触(CPAC)预测的计算方法旨在通过同时预测化合物-蛋白质相互作用的强度和模式来促进合理的药物发现。尽管所需的输出高度依赖于结构,但缺乏蛋白质结构往往使得无结构方法仅依赖于蛋白质序列输入。具有亲和力和接触标签的化合物-蛋白质对的稀缺性进一步限制了 CPAC 模型的准确性和泛化能力。

结果

为了克服上述结构盲目性和标记数据稀缺性的挑战,我们分别引入了跨模态和自监督学习,用于有感知结构和相关任务的蛋白质嵌入。具体来说,蛋白质数据在 1D 氨基酸序列和预测的 2D 接触图这两种模态中都可用,分别使用递归神经网络和图神经网络进行嵌入,以及使用两种跨模态方案进行联合嵌入。此外,两种蛋白质模态都在各种自监督学习策略下进行了预训练,利用了大量未标记的蛋白质数据。我们的结果表明,单独的蛋白质模态在预测亲和力或接触方面的能力存在差异。适当的跨模态蛋白质嵌入结合自监督学习可以提高模型在预测未见蛋白质的亲和力和接触时的泛化能力。

可用性和实现

数据和源代码可在 https://github.com/Shen-Lab/CPAC 上获得。

补充信息

补充数据可在生物信息学在线获得。

相似文献

5
DeepDTA: deep drug-target binding affinity prediction.深度 DTA:深度药物-靶标结合亲和力预测。
Bioinformatics. 2018 Sep 1;34(17):i821-i829. doi: 10.1093/bioinformatics/bty593.

本文引用的文献

1
Bringing Your Own View: Graph Contrastive Learning without Prefabricated Data Augmentations.提出自己的观点:无需预制数据增强的图对比学习
Proc Int Conf Web Search Data Min. 2022 Feb;2022:1300-1309. doi: 10.1145/3488560.3498416. Epub 2022 Feb 15.
4
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
7
Pfam: The protein families database in 2021.Pfam:2021 年的蛋白质家族数据库。
Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419. doi: 10.1093/nar/gkaa913.
9
Distance-based protein folding powered by deep learning.基于深度学习的距离相关蛋白质折叠。
Proc Natl Acad Sci U S A. 2019 Aug 20;116(34):16856-16865. doi: 10.1073/pnas.1821309116. Epub 2019 Aug 9.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验