Hu Yuting, Hu Feng, Zhang Hongwen, Xu Hongling, Gao Jixiang, Deng Wenshuai, Tian Zijing, Hu Qiaoyu, Li Honglin, Diao Yanyan
Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Xuhui District, Shanghai 200237, China.
Innovation Center for AI and Drug Discovery, School of Pharmacy, East China Normal University, 3663 North Zhongshan Road, Putuo District, Shanghai 200062, China.
Brief Bioinform. 2025 May 1;26(3). doi: 10.1093/bib/bbaf257.
Retrosynthetic route planning is essential for designing efficient pathways to synthesize complex molecules, serving as a cornerstone in drug discovery and organic synthesis. Sequence-based models have become a predominant approach in retrosynthetic route planning, yet its validity and robustness remain limited by the challenges in molecular representation methods. Current methods typically treat reactants and products as independent molecules, overlooking structural relationships crucial for accurate synthesis predictions. Herein, we introduce RPSubAlign, a molecular sequence representation method specifically tailored for retrosynthetic tasks, which aligns common substructures between reactants and products to enhance the validity and robustness of sequence-based models. Compared with conventional random and root-alignment representations, RPSubAlign achieves better performance on the USPTO-50K and USPTO-MIT datasets, improving up to a 34.8% increase in Top-N accuracy (with Self-Referencing Embedded Strings representation) and demonstrating enhanced stability across various data augmentation scenarios. RPSubAlign significantly improves syntactic validity, reaching 86.64% on USPTO-50K and 96.45% on USPTO-MIT (with Simplified Molecular Input Line Entry System representation), outperforming baseline methods. These results highlight RPSubAlign as a robust, effective approach for molecular characterization method for retrosynthesis predictions. Code for RPSubAlign is available at https://github.com/Aminoacid1226/RPSubAlign.
逆合成路线规划对于设计合成复杂分子的有效途径至关重要,是药物发现和有机合成的基石。基于序列的模型已成为逆合成路线规划中的主要方法,但其有效性和稳健性仍受分子表示方法挑战的限制。当前方法通常将反应物和产物视为独立分子,忽略了对准确合成预测至关重要的结构关系。在此,我们引入RPSubAlign,一种专门为逆合成任务量身定制的分子序列表示方法,它对齐反应物和产物之间的共同子结构,以增强基于序列模型的有效性和稳健性。与传统的随机和根对齐表示相比,RPSubAlign在USPTO-50K和USPTO-MIT数据集上表现更好,在Top-N准确率上提高了34.8%(使用自引用嵌入字符串表示),并在各种数据增强场景下展现出更高的稳定性。RPSubAlign显著提高了句法有效性,在USPTO-50K上达到86.64%,在USPTO-MIT上达到96.45%(使用简化分子输入线性输入系统表示),优于基线方法。这些结果突出了RPSubAlign作为一种用于逆合成预测的强大、有效的分子表征方法。RPSubAlign的代码可在https://github.com/Aminoacid1226/RPSubAlign获取。