Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York, NY 10016, USA.
Department of Computer Science, Hunter College, The City University of New York, New York, NY 10065, USA.
Bioinformatics. 2022 Apr 28;38(9):2561-2570. doi: 10.1093/bioinformatics/btac154.
Drug discovery has witnessed intensive exploration of predictive modeling of drug-target physical interactions over two decades. However, a critical knowledge gap needs to be filled for correlating drug-target interactions with clinical outcomes: predicting genome-wide receptor activities or function selectivity, especially agonist versus antagonist, induced by novel chemicals. Two major obstacles compound the difficulty on this task: known data of receptor activity is far too scarce to train a robust model in light of genome-scale applications, and real-world applications need to deploy a model on data from various shifted distributions.
To address these challenges, we have developed an end-to-end deep learning framework, DeepREAL, for multi-scale modeling of genome-wide ligand-induced receptor activities. DeepREAL utilizes self-supervised learning on tens of millions of protein sequences and pre-trained binary interaction classification to solve the data distribution shift and data scarcity problems. Extensive benchmark studies on G-protein coupled receptors (GPCRs), which simulate real-world scenarios, demonstrate that DeepREAL achieves state-of-the-art performances in out-of-distribution settings. DeepREAL can be extended to other gene families beyond GPCRs.
All data used are downloaded from Pfam (Mistry et al., 2020), GLASS (Chan et al., 2015) and IUPHAR/BPS and the data from reference (Sakamuru et al., 2021). Readers are directed to their official website for original data. Code is available on GitHub https://github.com/XieResearchGroup/DeepREAL.
Supplementary data are available at Bioinformatics online.
在过去的二十年中,药物发现领域已经对药物-靶标物理相互作用的预测模型进行了深入的探索。然而,在将药物-靶标相互作用与临床结果相关联方面,仍存在一个关键的知识空白需要填补:预测新型化学物质引起的全基因组受体活性或功能选择性,特别是激动剂与拮抗剂。这项任务面临两个主要障碍:鉴于基因组规模的应用,受体活性的已知数据非常稀缺,以至于无法训练出稳健的模型;实际应用需要在来自各种偏移分布的数据上部署模型。
为了解决这些挑战,我们开发了一个端到端的深度学习框架 DeepREAL,用于全基因组配体诱导的受体活性的多尺度建模。DeepREAL 利用数亿个蛋白质序列的自我监督学习和预先训练的二进制相互作用分类来解决数据分布偏移和数据稀缺的问题。在 G 蛋白偶联受体 (GPCR) 上进行的广泛基准研究,模拟了真实场景,表明 DeepREAL 在分布外设置中达到了最先进的性能。DeepREAL 可以扩展到 GPCR 以外的其他基因家族。
所有使用的数据均从 Pfam (Mistry 等人,2020)、GLASS (Chan 等人,2015) 和 IUPHAR/BPS 以及参考文献 (Sakamuru 等人,2021) 下载。读者可前往其官方网站获取原始数据。代码可在 GitHub 上获得 https://github.com/XieResearchGroup/DeepREAL。
补充数据可在生物信息学在线获得。