Wang Jike, Luo Hao, Qin Rui, Wang Mingyang, Wan Xiaozhe, Fang Meijing, Zhang Odin, Gou Qiaolin, Su Qun, Shen Chao, You Ziyi, Liu Liwei, Hsieh Chang-Yu, Hou Tingjun, Kang Yu
College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
Advanced Computing and Storage Laboratory, Central Research Institute, 2012 Laboratories, Huawei Technologies Co., Ltd Nanjing 210000 Jiangsu China
Chem Sci. 2024 Dec 4;16(2):637-648. doi: 10.1039/d4sc06864e. eCollection 2025 Jan 2.
The generation of three-dimensional (3D) molecules based on target structures represents a cutting-edge challenge in drug discovery. Many existing approaches often produce molecules with invalid configurations, unphysical conformations, suboptimal drug-like qualities, limited synthesizability, and require extensive generation times. To address these challenges, we present 3DSMILES-GPT, a fully language-model-driven framework for 3D molecular generation that utilizes tokens exclusively. We treat both two-dimensional (2D) and 3D molecular representations as linguistic expressions, combining them through full-dimensional representations and pre-training the model on a vast dataset encompassing tens of millions of drug-like molecules. This token-only approach enables the model to comprehensively understand the 2D and 3D characteristics of large-scale molecules. Subsequently, we fine-tune the model using pair-wise structural data of protein pockets and molecules, followed by reinforcement learning to further optimize the biophysical and chemical properties of the generated molecules. Experimental results demonstrate that 3DSMILES-GPT generates molecules that comprehensively outperform existing methods in terms of binding affinity, drug-likeness (QED), and synthetic accessibility score (SAS). Notably, it achieves a 33% enhancement in the quantitative estimation of QED, meanwhile the binding affinity estimated by Vina docking maintaining its state-of-the-art performance. The generation speed is remarkably fast, with the average time approximately 0.45 seconds per generation, representing a threefold increase over the fastest existing methods. This innovative 3DSMILES-GPT approach has the potential to positively impact the generation of 3D molecules in drug discovery.
基于靶标结构生成三维(3D)分子是药物发现中的一项前沿挑战。许多现有方法常常产生构型无效、构象不合理、类药性质欠佳、合成可行性有限的分子,并且需要大量的生成时间。为应对这些挑战,我们提出了3DSMILES-GPT,这是一个完全由语言模型驱动的3D分子生成框架,它仅使用标记。我们将二维(2D)和3D分子表示都视为语言表达式,通过全维表示将它们结合起来,并在包含数千万个类药分子的海量数据集上对模型进行预训练。这种仅使用标记的方法使模型能够全面理解大规模分子的2D和3D特征。随后,我们使用蛋白质口袋和分子的成对结构数据对模型进行微调,然后通过强化学习进一步优化生成分子的生物物理和化学性质。实验结果表明,3DSMILES-GPT生成的分子在结合亲和力、类药性质(QED)和合成可及性得分(SAS)方面全面优于现有方法。值得注意的是,它在QED的定量估计中提高了33%,同时通过Vina对接估计的结合亲和力保持了其领先性能。生成速度非常快,平均每次生成时间约为0.45秒,比现有最快方法快三倍。这种创新的3DSMILES-GPT方法有可能对药物发现中3D分子的生成产生积极影响。