Thakkar Amol, Kogej Thierry, Reymond Jean-Louis, Engkvist Ola, Bjerrum Esben Jannik
Hit Discovery , Discovery Sciences, R&D , AstraZeneca , Gothenburg , Sweden . Email:
Department of Chemistry and Biochemistry , University of Bern , Bern , Switzerland . Email:
Chem Sci. 2019 Nov 5;11(1):154-168. doi: 10.1039/c9sc04944d. eCollection 2020 Jan 7.
Computer Assisted Synthesis Planning (CASP) has gained considerable interest as of late. Herein we investigate a template-based retrosynthetic planning tool, trained on a variety of datasets consisting of up to 17.5 million reactions. We demonstrate that models trained on datasets such as internal Electronic Laboratory Notebooks (ELN), and the publicly available United States Patent Office (USPTO) extracts, are sufficient for the prediction of full synthetic routes to compounds of interest in medicinal chemistry. As such we have assessed the models on 1731 compounds from 41 virtual libraries for which experimental results were known. Furthermore, we show that accuracy is a misleading metric for assessment of the policy network, and propose that the number of successfully applied templates, in conjunction with the overall ability to generate full synthetic routes be examined instead. To this end we found that the specificity of the templates comes at the cost of generalizability, and overall model performance. This is supplemented by a comparison of the underlying datasets and their corresponding models.
计算机辅助合成规划(CASP)近来已引起了相当大的关注。在此,我们研究了一种基于模板的逆合成规划工具,该工具在包含多达1750万个反应的各种数据集上进行了训练。我们证明,在诸如内部电子实验室笔记本(ELN)以及公开可用的美国专利局(USPTO)提取物等数据集上训练的模型,足以预测药物化学中目标化合物的完整合成路线。因此,我们针对来自41个虚拟库的1731种化合物评估了这些模型,这些化合物的实验结果是已知的。此外,我们表明准确性是评估策略网络的一个误导性指标,并建议改为检查成功应用的模板数量以及生成完整合成路线的整体能力。为此,我们发现模板的特异性是以通用性和整体模型性能为代价的。对基础数据集及其相应模型的比较进一步补充了这一点。