Cui Yongrui, Shan Dongjing, Lu Qiheng, Zou Beijia, Zhang Huali, Li Jin, Mao Jiashun
School of Computer Science and Engineering, Dalian Minzu University, Dalian, Liaoning Province, 116600, China.
School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan Province, 646000, China.
J Comput Aided Mol Des. 2025 Jul 18;39(1):54. doi: 10.1007/s10822-025-00614-3.
In recent years, the emergence of large language models (LLMs), particularly the advent of ChatGPT, has positioned natural language sequence-based representation learning and generative models as the dominant research paradigm in AI for science. Within the domains of drug discovery and computational chemistry, compound representation learning and molecular generation stand out as two of the most significant tasks. Currently, the predominant molecular representation sequences used for molecular characterization and generation include SMILES (Simplified Molecular-Input Line-Entry System), SELFIES (SELF-referencing Embedded Strings), SMARTS (Smiles Arbitrary Target Specification), and IUPAC (International Union of Pure and Applied Chemistry) nomenclature. In the context of AI-assisted drug design, each of these molecular languages has its own strengths and weaknesses, and the granularity of information encoded by different molecular representation forms varies significantly. However, the selection of an appropriate molecular representation as the input format for model training is crucial, yet this issue has not been thoroughly explored. Furthermore, the state-of-the-art models currently employed for molecular generation and optimization are diffusion models. Therefore, this study investigates the characteristics of the four mainstream molecular representation languages within the same diffusion model for training generative molecular sets. First, a single molecule is represented in four different ways through varying methodologies, followed by training a denoising diffusion model using identical parameters. Subsequently, thirty thousand molecules are generated for evaluation and analysis. The results indicate that the four molecular representation languages exhibit both similarities and differences in attribute distribution and spatial distribution; notably, SELFIES and SMARTS demonstrate a high degree of similarity, while IUPAC and SMILES show substantial differences. Additionally, IUPAC's primary advantage lies in the novelty and diversity of generated molecules, whereas SMILES excels in QEPPI and SAscore metrics, with SELFIES and SMARTS performing best on the QED metric. The findings of this research will provide crucial insights into the selection of molecular representations in AI drug design tasks, thereby contributing to enhanced efficiency in drug development.
近年来,大语言模型(LLMs)的出现,尤其是ChatGPT的问世,使基于自然语言序列的表征学习和生成模型成为人工智能在科学领域的主导研究范式。在药物发现和计算化学领域,化合物表征学习和分子生成是两项最为重要的任务。目前,用于分子表征和生成的主要分子表征序列包括SMILES(简化分子输入线性输入系统)、SELFIES(自引用嵌入字符串)、SMARTS(Smiles任意目标规范)和IUPAC(国际纯粹与应用化学联合会)命名法。在人工智能辅助药物设计的背景下,这些分子语言各有优缺点,不同分子表征形式编码的信息粒度差异显著。然而,选择合适的分子表征作为模型训练的输入格式至关重要,但这一问题尚未得到充分探索。此外,目前用于分子生成和优化的最先进模型是扩散模型。因此,本研究在同一扩散模型中研究了四种主流分子表征语言用于训练生成分子集的特性。首先,通过不同方法以四种不同方式表示单个分子,然后使用相同参数训练去噪扩散模型。随后,生成三万个分子进行评估和分析。结果表明,四种分子表征语言在属性分布和空间分布上既有相似之处,也有不同之处;值得注意的是,SELFIES和SMARTS表现出高度相似性,而IUPAC和SMILES则有显著差异。此外,IUPAC的主要优势在于生成分子的新颖性和多样性,而SMILES在QEPPI和SAscore指标方面表现出色,SELFIES和SMARTS在QED指标上表现最佳。本研究结果将为人工智能药物设计任务中分子表征的选择提供关键见解,从而有助于提高药物开发效率。