Yu Jiahui, Wang Jike, Zhao Hong, Gao Junbo, Kang Yu, Cao Dongsheng, Wang Zhe, Hou Tingjun
Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China.
School of Computer Science, Wuhan University, Wuhan 430072, Hubei, P. R. China.
J Chem Inf Model. 2022 Jun 27;62(12):2973-2986. doi: 10.1021/acs.jcim.2c00038. Epub 2022 Jun 8.
Accurate estimation of the synthetic accessibility of small molecules is needed in many phases of drug discovery. Several expert-crafted scoring methods and descriptor-based quantitative structure-activity relationship (QSAR) models have been developed for synthetic accessibility assessment, but their practical applications in drug discovery are still quite limited because of relatively low prediction accuracy and poor model interpretability. In this study, we proposed a data-driven interpretable prediction framework called GASA (Graph Attention-based assessment of Synthetic Accessibility) to evaluate the synthetic accessibility of small molecules by distinguishing compounds to be easy- (ES) or hard-to-synthesize (HS). GASA is a graph neural network (GNN) architecture that makes self-feature deduction by applying an attention mechanism to automatically capture the most important structural features related to synthetic accessibility. The sampling around the hypothetical classification boundary was used to improve the ability of GASA to distinguish structurally similar molecules. GASA was extensively evaluated and compared with two descriptor-based machine learning methods (random forest, RF; eXtreme gradient boosting, XGBoost) and four existing scores (SYBA: SYnthetic Bayesian Accessibility; SCScore: Synthetic Complexity score; RAscore: Retrosynthetic Accessibility score; SAscore: Synthetic Accessibility score). Our analysis demonstrates that GASA achieved remarkable performance in distinguishing similar molecules compared with other methods and had a broader applicability domain. In addition, we show how GASA learns the important features that affect molecular synthetic accessibility by assigning attention weights to different atoms. An online prediction service for GASA was offered at http://cadd.zju.edu.cn/gasa/.
在药物发现的许多阶段,都需要准确估计小分子的合成可及性。已经开发了几种由专家精心设计的评分方法和基于描述符的定量构效关系(QSAR)模型用于合成可及性评估,但由于预测准确性相对较低和模型可解释性差,它们在药物发现中的实际应用仍然相当有限。在本研究中,我们提出了一种数据驱动的可解释预测框架,称为GASA(基于图注意力的合成可及性评估),通过区分易于合成(ES)或难以合成(HS)的化合物来评估小分子的合成可及性。GASA是一种图神经网络(GNN)架构,它通过应用注意力机制进行自特征推导,以自动捕获与合成可及性相关的最重要结构特征。在假设分类边界周围进行采样,以提高GASA区分结构相似分子的能力。我们对GASA进行了广泛评估,并与两种基于描述符的机器学习方法(随机森林,RF;极端梯度提升,XGBoost)和四个现有分数(SYBA:合成贝叶斯可及性;SCScore:合成复杂性分数;RAscore:逆合成可及性分数;SAscore:合成可及性分数)进行了比较。我们的分析表明,与其他方法相比,GASA在区分相似分子方面表现出色,并且具有更广泛的适用范围。此外,我们展示了GASA如何通过为不同原子分配注意力权重来学习影响分子合成可及性的重要特征。可通过http://cadd.zju.edu.cn/gasa/获得GASA的在线预测服务。