Wang Yishu, Guo Mengyao, Chen Xiaomin, Ai Dongmei
School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, 100083, China.
Sci Rep. 2025 Feb 5;15(1):4419. doi: 10.1038/s41598-025-86840-z.
Traditional virtual screening methods need to explore expanse and vast chemical spaces and need to be based on existing chemical libraries. With the development of deep learning techniques for the de novo generation of molecules, also known as inverse molecular design, the increasingly widespread application of various types of deep learning algorithms has led to revolutionary changes in de novo molecular generation research. In particular, the emergence of a novel natural language processing (NLP) architecture called the transformer has improved the state-of-the-art performance of existing AI technologies. In this study, we modified one top-performing molecular generation model on the basis of the generative pretraining transformer (GPT) architecture in three directions. Moreover, we propose an integrated end-to-end neural network learning framework based on one complete encoder-decoder architecture transformer model: Transfer Text-to-Text Transformer (T5), by learning the embedding vector representation space of conditional molecular properties to encode and guide the vector representation of SMILES sequences, resulting in the output of the final decoder block with a softmax output (maximum likelihood objective). Moreover, we evaluated the performance of these NLP-based generation models and another new model architecture based on a selective state space and selected the best approach jointing a transfer learning strategy for de novo drug discovery to target L858R/T790M/C797S-mutant EGFR in non-small cell lung cancer.
传统的虚拟筛选方法需要探索广阔的化学空间,并且需要基于现有的化学库。随着用于从头生成分子(也称为逆分子设计)的深度学习技术的发展,各种类型深度学习算法的日益广泛应用给从头分子生成研究带来了革命性变化。特别是,一种名为Transformer的新型自然语言处理(NLP)架构的出现提升了现有人工智能技术的最先进性能。在本研究中,我们在生成式预训练Transformer(GPT)架构的基础上,从三个方向对一个表现优异的分子生成模型进行了修改。此外,我们基于一个完整的编码器 - 解码器架构Transformer模型——迁移文本到文本Transformer(T5),提出了一个集成的端到端神经网络学习框架,通过学习条件分子性质的嵌入向量表示空间来编码并指导SMILES序列的向量表示,最终解码器模块通过softmax输出(最大似然目标)输出结果。此外,我们评估了这些基于NLP的生成模型以及另一种基于选择性状态空间的新模型架构的性能,并选择了最佳方法结合迁移学习策略用于非小细胞肺癌中靶向L858R/T790M/C797S突变型表皮生长因子受体(EGFR)的从头药物发现。