Department of Computer Science, Hunter College, The City University of New York, New York City, New York, United States of America.
Ph.D. Program in Computer Science, The Graduate Center, The City University of New York, New York City, New York, United States of America.
PLoS Comput Biol. 2023 Aug 17;19(8):e1010974. doi: 10.1371/journal.pcbi.1010974. eCollection 2023 Aug.
Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.
蛋白水解靶向嵌合体(PROTACs)是一种异双功能分子,通过招募 E3 连接酶诱导靶蛋白降解。PROTACs 有可能使小分子认为不可成药的疾病相关基因失活,使其成为治疗不治之症的有前途的疗法。然而,只有几百种蛋白质已经过实验测试,以确定它们是否适合 PROTACs,并且仍然不清楚整个人类基因组中的哪些其他蛋白质可以被 PROTACs 靶向。在这项研究中,我们开发了 PrePROTAC,这是一种基于基于变压器的蛋白质序列描述符和随机森林分类的可解释机器学习模型。PrePROTAC 预测了可以被 E3 连接酶之一 CRBN 降解的全基因组靶标。在基准研究中,PrePROTAC 达到了 ROC-AUC 为 0.81,平均精度为 0.84,假阳性率为 0.05 时灵敏度超过 40%。当通过包含与训练集中的蛋白质不同结构折叠的外部测试集进行评估时,PrePROTAC 的性能没有明显下降,表明其具有通用性。此外,我们开发了一种嵌入 SHapley Additive exPlanations(eSHAP)方法,该方法通过计算机诱变将原始特征的常规 SHAP 分析扩展到嵌入空间。该方法使我们能够确定蛋白质结构中的关键残基,这些残基在 PROTAC 活性中起着关键作用。鉴定出的关键残基与现有知识一致。使用 PrePROTAC,我们鉴定了 600 多种新的研究不足的潜在可被 CRBN 降解的蛋白质,并为三种与阿尔茨海默病相关的新型药物靶标提出了 PROTAC 化合物。