Han Herim, Yeom Min Sun, Choi Sunghwan
NamuICT R&D Center, NamuICT, 41 Magok Jungang 8-ro, Seoul, 07793, Republic of Korea.
Department of Chemistry, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea.
Sci Rep. 2025 May 15;15(1):16892. doi: 10.1038/s41598-025-01890-7.
The Simplified Molecular Input Line Entry System (SMILES) is one of the most widely adopted molecular representations. However, SMILES notation suffers from limited token diversity and a lack of chemical information within individual tokens. To address these limitations while maintaining its simplicity, we propose a molecular representation method through the hybridization of standard SMILES tokens with Atom-In-SMILES (AIS) tokens, which incorporate local chemical environment information into a single token. This hybrid representation, termed SMI + AIS, combines SMILES and AIS tokens, allowing AIS tokens to differentiate chemical elements based on their chemical context without introducing additional tokens for less frequent elements. Using the SMI + AIS representation, we evaluated its performance by comparing the predefined metric of generated structures in chemical structure generation based on latent space optimization. Compared to standard SMILES, SMI + AIS achieved a 7% improvement in binding affinity and a 6% increase in synthesizability, highlighting its utility in the enhancement of machine learning-based molecular design. Our results demonstrate that the SMI + AIS representation provides a more effective and informative approach to encapsulate chemical context and presents potential for performance enhancement in other machine learning tasks in chemistry.
简化分子输入线性输入系统(SMILES)是应用最为广泛的分子表示方法之一。然而,SMILES符号存在令牌多样性有限以及单个令牌内缺乏化学信息的问题。为解决这些局限性并同时保持其简单性,我们提出了一种分子表示方法,即将标准SMILES令牌与“SMILES中的原子”(AIS)令牌进行混合,后者将局部化学环境信息整合到单个令牌中。这种混合表示法称为SMI + AIS,它结合了SMILES和AIS令牌,使AIS令牌能够根据化学上下文区分化学元素,而无需为不常见元素引入额外令牌。使用SMI + AIS表示法,我们通过比较基于潜在空间优化的化学结构生成中生成结构的预定义指标来评估其性能。与标准SMILES相比,SMI + AIS的结合亲和力提高了7%,合成性提高了6%,突出了其在增强基于机器学习的分子设计中的效用。我们的结果表明,SMI + AIS表示法提供了一种更有效且信息更丰富的方法来封装化学上下文,并在化学中的其他机器学习任务中具有性能提升的潜力。