Suppr超能文献

超越用于生成式药物发现中数据增强的SMILES枚举法。

Going beyond SMILES enumeration for data augmentation in generative drug discovery.

作者信息

Brinkmann Helena, Argante Antoine, Ter Steege Hugo, Grisoni Francesca

机构信息

Institute for Complex Molecular Systems (ICMS), Eindhoven AI Systems Institute (EAISI), Department of Biomedical Engineering, Eindhoven University of Technology Eindhoven The Netherlands

Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht Utrecht The Netherlands.

出版信息

Digit Discov. 2025 Aug 14. doi: 10.1039/d5dd00028a.

Abstract

Data augmentation can alleviate the limitations of small molecular datasets for generative deep learning by 'artificially inflating' the number of instances available for training. SMILES enumeration - wherein multiple valid SMILES strings are used to represent the same molecules - has become particularly beneficial to improve the quality of molecule design. Herein, we investigated whether rethinking SMILES augmentation techniques could further enhance the quality of design. To this end, we introduce four novel approaches for SMILES augmentation, drawing inspiration from natural language processing and chemistry insights: (a) token deletion, (b) atom masking, (c) bioisosteric substitution, and (d) self-training. systematic analysis, our results showed the promise of considering additional strategies for SMILES augmentation. Every strategy showed distinct advantages; for example, atom masking is particularly promising to learn desirable physico-chemical properties in very low-data regimes, and deletion to create novel scaffolds. This new repertoire of SMILES augmentation strategies expands the available toolkit to design molecules with bespoke properties in low-data scenarios.

摘要

数据增强可以通过“人为扩充”可用于训练的实例数量来缓解小分子数据集在生成式深度学习方面的局限性。SMILES枚举(即使用多个有效的SMILES字符串来表示相同的分子)已被证明对提高分子设计质量特别有益。在此,我们研究了重新思考SMILES增强技术是否可以进一步提高设计质量。为此,我们从自然语言处理和化学见解中汲取灵感,引入了四种新颖的SMILES增强方法:(a)令牌删除,(b)原子掩码,(c)生物电子等排体替换,以及(d)自我训练。通过系统分析,我们的结果显示了考虑其他SMILES增强策略的前景。每种策略都显示出独特的优势;例如,原子掩码在极低数据量的情况下对于学习理想的物理化学性质特别有前景,而删除则有助于创建新的骨架。这套新的SMILES增强策略扩展了可用的工具集,以便在低数据场景中设计具有定制属性的分子。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0b2d/12409607/3f516126bad9/d5dd00028a-f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验