基于 MSA-Regularized 蛋白质序列转换器的全基因组化学蛋白质相互作用预测：在 GPCRome 去孤儿化中的应用。

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization.

机构信息

Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, New York 10016, United States.

Ph.D. Program in Biochemistry, The Graduate Center, The City University of New York, New York, New York 10016, United States.

出版信息

J Chem Inf Model. 2021 Apr 26;61(4):1570-1582. doi: 10.1021/acs.jcim.0c01285. Epub 2021 Mar 23.

DOI:10.1021/acs.jcim.0c01285

PMID:33757283

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8154251/

Abstract

Small molecules play a critical role in modulating biological systems. Knowledge of chemical-protein interactions helps address fundamental and practical questions in biology and medicine. However, with the rapid emergence of newly sequenced genes, the endogenous or surrogate ligands of a vast number of proteins remain unknown. Homology modeling and machine learning are two major methods for assigning new ligands to a protein but mostly fail when sequence homology between an unannotated protein and those with known functions or structures is low. In this study, we develop a new deep learning framework to predict chemical binding to evolutionary divergent unannotated proteins, whose ligand cannot be reliably predicted by existing methods. By incorporating evolutionary information into self-supervised learning of unlabeled protein sequences, we develop a novel method, distilled sequence alignment embedding (DISAE), for the protein sequence representation. DISAE can utilize all protein sequences and their multiple sequence alignment (MSA) to capture functional relationships between proteins without the knowledge of their structure and function. Followed by the DISAE pretraining, we devise a module-based fine-tuning strategy for the supervised learning of chemical-protein interactions. In the benchmark studies, DISAE significantly improves the generalizability of machine learning models and outperforms the state-of-the-art methods by a large margin. Comprehensive ablation studies suggest that the use of MSA, sequence distillation, and triplet pretraining critically contributes to the success of DISAE. The interpretability analysis of DISAE suggests that it learns biologically meaningful information. We further use DISAE to assign ligands to human orphan G-protein coupled receptors (GPCRs) and to cluster the human GPCRome by integrating their phylogenetic and ligand relationships. The promising results of DISAE open an avenue for exploring the chemical landscape of entire sequenced genomes.

摘要

小分子在调节生物系统中起着至关重要的作用。对化学-蛋白质相互作用的了解有助于解决生物学和医学中的基础和实际问题。然而，随着新测序基因的迅速出现，大量蛋白质的内源性或替代配体仍然未知。同源建模和机器学习是为蛋白质分配新配体的两种主要方法，但当未注释蛋白质与具有已知功能或结构的蛋白质之间的序列同源性较低时，这两种方法大多会失败。在这项研究中，我们开发了一种新的深度学习框架，用于预测化学结合进化上不同的未注释蛋白质，对于这些蛋白质，现有方法无法可靠地预测其配体。通过将进化信息纳入未标记蛋白质序列的自监督学习中，我们开发了一种新的方法，即蒸馏序列对齐嵌入（Distilled Sequence Alignment Embedding，DISAE），用于蛋白质序列表示。DISAE 可以利用所有蛋白质序列及其多重序列比对（Multiple Sequence Alignment，MSA）来捕获蛋白质之间的功能关系，而无需了解其结构和功能。在 DISAE 预训练之后，我们设计了一种基于模块的微调策略，用于化学-蛋白质相互作用的监督学习。在基准研究中，DISAE 显著提高了机器学习模型的泛化能力，并以较大的优势超过了最先进的方法。全面的消融研究表明，使用 MSA、序列蒸馏和三元组预训练对 DISAE 的成功至关重要。DISAE 的可解释性分析表明，它学习了有生物学意义的信息。我们进一步使用 DISAE 为人类孤儿 G 蛋白偶联受体（GPCR）分配配体，并通过整合它们的进化和亲缘关系对人类 GPCR 组进行聚类。DISAE 的有前途的结果为探索整个测序基因组的化学景观开辟了一条途径。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/69cb/8154251/c2075a5d3c24/ci0c01285_0001.jpg

相似文献

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization.基于 MSA-Regularized 蛋白质序列转换器的全基因组化学蛋白质相互作用预测：在 GPCRome 去孤儿化中的应用。

J Chem Inf Model. 2021 Apr 26;61(4):1570-1582. doi: 10.1021/acs.jcim.0c01285. Epub 2021 Mar 23.

End-to-end sequence-structure-function meta-learning predicts genome-wide chemical-protein interactions for dark proteins.端到端序列-结构-功能元学习预测全基因组化学-蛋白质相互作用的暗蛋白质。

PLoS Comput Biol. 2023 Jan 18;19(1):e1010851. doi: 10.1371/journal.pcbi.1010851. eCollection 2023 Jan.

Pairing interacting protein sequences using masked language modeling.使用掩蔽语言模型对相互作用的蛋白质序列进行配对。

Proc Natl Acad Sci U S A. 2024 Jul 2;121(27):e2311887121. doi: 10.1073/pnas.2311887121. Epub 2024 Jun 24.

Cross genome phylogenetic analysis of human and Drosophila G protein-coupled receptors: application to functional annotation of orphan receptors.人类和果蝇G蛋白偶联受体的全基因组系统发育分析：应用于孤儿受体的功能注释

BMC Genomics. 2005 Aug 10;6:106. doi: 10.1186/1471-2164-6-106.

A survey on the algorithm and development of multiple sequence alignment.多序列比对算法与发展研究综述。

Brief Bioinform. 2022 May 13;23(3). doi: 10.1093/bib/bbac069.

Evolutionary sequence modeling for discovery of peptide hormones.用于发现肽类激素的进化序列建模

PLoS Comput Biol. 2009 Jan;5(1):e1000258. doi: 10.1371/journal.pcbi.1000258. Epub 2009 Jan 9.

Generative power of a protein language model trained on multiple sequence alignments.基于多序列比对训练的蛋白质语言模型的生成能力。

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives.多种序列比对方法的综合基准研究：当前的挑战与未来展望。

PLoS One. 2011 Mar 31;6(3):e18093. doi: 10.1371/journal.pone.0018093.

引用本文的文献

Harnessing Noncanonical Proteins for Next-Generation Drug Discovery and Diagnosis.利用非规范蛋白质进行下一代药物发现与诊断。

WIREs Mech Dis. 2025 May-Jun;17(3):e70001. doi: 10.1002/wsbm.70001.

Semi-supervised meta-learning elucidates understudied molecular interactions.半监督元学习阐明了研究不足的分子相互作用。

Commun Biol. 2024 Sep 9;7(1):1104. doi: 10.1038/s42003-024-06797-z.

Gene Target Prediction of Environmental Chemicals Using Coupled Matrix-Matrix Completion.利用耦合矩阵-矩阵补全技术进行环境化学物的基因靶点预测。

Environ Sci Technol. 2024 Apr 2;58(13):5889-5898. doi: 10.1021/acs.est.4c00458. Epub 2024 Mar 19.

Orphan G protein-coupled receptors: the ongoing search for a home.孤儿G蛋白偶联受体：仍在寻找归属

Front Pharmacol. 2024 Feb 29;15:1349097. doi: 10.3389/fphar.2024.1349097. eCollection 2024.

Attention is all you need: utilizing attention in AI-enabled drug discovery.注意力就是你需要的一切：在人工智能药物发现中利用注意力机制。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad467.

Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design.利用机器学习加速生物催化发现：酶工程、发现与设计的范式转变

ACS Catal. 2023 Oct 26;13(21):14454-14469. doi: 10.1021/acscatal.3c03417. eCollection 2023 Nov 3.

Sequence-based drug design as a concept in computational drug design.基于序列的药物设计作为计算药物设计中的一个概念。

Nat Commun. 2023 Jul 14;14(1):4217. doi: 10.1038/s41467-023-39856-w.

Machine Learning Methods for Small Data Challenges in Molecular Science.机器学习方法在分子科学中小数据挑战中的应用。

Chem Rev. 2023 Jul 12;123(13):8736-8780. doi: 10.1021/acs.chemrev.3c00189. Epub 2023 Jun 29.

PLoS Comput Biol. 2023 Jan 18;19(1):e1010851. doi: 10.1371/journal.pcbi.1010851. eCollection 2023 Jan.

Transformer-based deep learning for predicting protein properties in the life sciences.基于 Transformer 的深度学习在生命科学中预测蛋白质性质。

Elife. 2023 Jan 18;12:e82819. doi: 10.7554/eLife.82819.

本文引用的文献

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning.ProtTrans：通过自监督学习理解生命语言。

IEEE Trans Pattern Anal Mach Intell. 2022 Oct;44(10):7112-7127. doi: 10.1109/TPAMI.2021.3095381. Epub 2022 Sep 14.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

GPCRdb in 2021: integrating GPCR sequence, structure and function.GPCRdb 2021 年更新：整合 G 蛋白偶联受体序列、结构和功能。

Nucleic Acids Res. 2021 Jan 8;49(D1):D335-D343. doi: 10.1093/nar/gkaa1080.

GraphDTA: predicting drug-target binding affinity with graph neural networks.GraphDTA：基于图神经网络的药物-靶标结合亲和力预测。

Bioinformatics. 2021 May 23;37(8):1140-1147. doi: 10.1093/bioinformatics/btaa921.

TransformerCPI: improving compound-protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments.TransformerCPI：通过基于序列的深度学习、自注意力机制和标签反转实验提高化合物-蛋白质相互作用预测。

Bioinformatics. 2020 Aug 15;36(16):4406-4414. doi: 10.1093/bioinformatics/btaa524.

Identification of functional divergence sites in dopamine receptors of vertebrates.鉴定脊椎动物多巴胺受体中的功能分歧位点。

Comput Biol Chem. 2019 Dec;83:107140. doi: 10.1016/j.compbiolchem.2019.107140. Epub 2019 Oct 24.

Treeio: An R Package for Phylogenetic Tree Input and Output with Richly Annotated and Associated Data.Treeio：一个用于系统发育树输入和输出的 R 包，具有丰富的注释和相关数据。

Mol Biol Evol. 2020 Feb 1;37(2):599-603. doi: 10.1093/molbev/msz240.

Revisiting the classification of adhesion GPCRs.重新审视黏附 GPCR 分类。

Ann N Y Acad Sci. 2019 Nov;1456(1):80-95. doi: 10.1111/nyas.14192. Epub 2019 Jul 31.

Exploring the dark genome: implications for precision medicine.探索暗基因组：对精准医学的启示。

Mamm Genome. 2019 Aug;30(7-8):192-200. doi: 10.1007/s00335-019-09809-0. Epub 2019 Jul 4.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 MSA-Regularized 蛋白质序列转换器的全基因组化学蛋白质相互作用预测：在 GPCRome 去孤儿化中的应用。

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献