Alhmoudi Obaid Khaleifah, Aboushanab Mahmoud, Thameem Muhammed, Elkamel Ali, AlHammadi Ali A
Department of Chemical & Petroleum Engineering, Khalifa University of Science and Technology, P.O. Box 127788, Abu Dhabi, United Arab Emirates.
Research and Innovation Center on CO2 and Hydrogen (RICH Center), Khalifa University of Science and Technology , P.O. Box 127788, Abu Dhabi, United Arab Emirates.
Sci Rep. 2025 Jul 2;15(1):23627. doi: 10.1038/s41598-025-05017-w.
Accurate molecular property prediction requires input representations that preserve substructural details and maintain syntactic consistency. SMILES (Simplified Molecular Input Line Entry System), while widely used, does not guarantee validity and allows multiple representations for the same compound. SELFIES (Self-Referencing Embedded Strings) addresses these limitations through a robust grammar that ensures structural validity. This study investigates whether a SMILES-pretrained transformer, ChemBERTa-zinc-base-v1, can be adapted to SELFIES using domain-adaptive pretraining without modifying the tokenizer or model architecture. Approximately 700,000 SELFIES-formatted molecules from PubChem were used for adaptation, completed within 12 h on a single NVIDIA A100 GPU. Embedding-level evaluation included t-distributed stochastic neighbor embedding (t-SNE), cosine similarity, and regression on twelve QM9 properties using frozen transformer weights. The domain-adapted model outperformed the original SMILES baseline and slightly outperformed the performance of ChemBERTa-77 M-MLM across most targets, despite a 100-fold difference in pretraining scale. For downstream evaluation, the model was fine-tuned end-to-end on ESOL, FreeSolv, and Lipophilicity, achieving root mean squared error (RMSE) values of 0.944, 2.511, and 0.746, respectively. These results demonstrate that SELFIES-based adaptation offers a cost-efficient alternative for molecular property prediction, without relying on molecular descriptors, 3D features, or large-scale infrastructure.
准确的分子性质预测需要能够保留子结构细节并保持句法一致性的输入表示形式。简化分子输入线性输入系统(SMILES)虽然被广泛使用,但并不能保证有效性,并且允许同一化合物有多种表示形式。自引用嵌入字符串(SELFIES)通过一种强大的语法解决了这些限制,该语法可确保结构有效性。本研究调查了一个经过SMILES预训练的变压器模型ChemBERTa-zinc-base-v1,能否在不修改分词器或模型架构的情况下,通过域自适应预训练来适应SELFIES。大约70万个来自PubChem的SELFIES格式分子被用于适应性训练,在单个NVIDIA A100 GPU上12小时内完成。嵌入级评估包括t分布随机邻域嵌入(t-SNE)、余弦相似度,以及使用冻结的变压器权重对12个QM9性质进行回归分析。尽管预训练规模相差100倍,但经过域适应的模型在大多数目标上优于原始的SMILES基线,并且略优于ChemBERTa-77M-MLM的性能。对于下游评估,该模型在ESOL、FreeSolv和亲脂性方面进行了端到端的微调,分别实现了均方根误差(RMSE)值0.944、2.511和0.746。这些结果表明,基于SELFIES的适应为分子性质预测提供了一种经济高效的替代方案,而无需依赖分子描述符、3D特征或大规模基础设施。