Shee Yu, Morgunov Anton, Li Haote, Batista Victor S
Department of Chemistry, Yale University, P.O. Box 208107, New Haven, Connecticut 06520-8107, United States.
J Chem Inf Model. 2025 Apr 28;65(8):3903-3914. doi: 10.1021/acs.jcim.4c01982. Epub 2025 Apr 8.
Traditional computer-aided synthesis planning (CASP) methods rely on iterative single-step predictions, leading to exponential search space growth that limits efficiency and scalability. We introduce a series of transformer-based models that leverage a mixture of experts approach to directly generate multistep synthetic routes as a single string, conditionally predicting each transformation based on all preceding ones. Our DMS Explorer XL model, which requires only target compounds as input, outperforms state-of-the-art methods on the PaRoutes dataset with 1.9x and 3.1x improvements in Top-1 accuracy on the n and n test sets, respectively. Providing additional information, such as the desired number of steps and starting materials, enables both a reduction in model size and an increase in accuracy, highlighting the benefits of incorporating more constraints into the prediction process. The top-performing DMS-Flex (Duo) model scores 25-50% higher on Top-1 and Top-10 accuracies for both n and n sets. Additionally, our models successfully predict routes for the FDA-approved drugs not included in the training data, demonstrating strong generalization capabilities. While the limited diversity of the training set may affect performance on less common reaction types, our multistep-first approach presents a promising direction toward fully automated retrosynthetic planning.
传统的计算机辅助合成规划(CASP)方法依赖于迭代的单步预测,导致指数级的搜索空间增长,从而限制了效率和可扩展性。我们引入了一系列基于Transformer的模型,这些模型采用专家混合方法,直接生成多步合成路线作为单个字符串,并根据之前的所有步骤有条件地预测每个转化步骤。我们的DMS Explorer XL模型仅需目标化合物作为输入,在PaRoutes数据集上优于现有方法,在n和n测试集上的Top-1准确率分别提高了1.9倍和3.1倍。提供额外信息,如所需的步骤数和起始原料,既能减小模型规模,又能提高准确率,突出了在预测过程中纳入更多约束的好处。表现最佳的DMS-Flex(Duo)模型在n和n集的Top-1和Top-10准确率上高出25%-50%。此外,我们的模型成功预测了训练数据中未包含的FDA批准药物的路线,展示了强大的泛化能力。虽然训练集的有限多样性可能会影响对不太常见反应类型的性能,但我们的多步优先方法为完全自动化的逆合成规划提供了一个有前途的方向。