Özçelik Rıza, Grisoni Francesca
Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering Eindhoven Netherlands
Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht Netherlands.
Digit Discov. 2024 Dec 16;4(2):316-325. doi: 10.1039/d4dd00311j. eCollection 2025 Feb 12.
Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (, Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, , in terms of neural network architectures, molecular representations, and hyperparameter optimization.
深度学习显著加速了药物发现,“化学语言”处理(CLP)作为一种突出的方法应运而生。CLP方法通过类似于自然语言处理的方法,从分子字符串表示(如简化分子输入线输入系统[SMILES]和自引用嵌入字符串[SELFIES])中学习。尽管它们的重要性日益增加,但训练预测性CLP模型绝非易事,因为它涉及许多“花里胡哨的东西”。在这里,我们分析了CLP的关键要素,并为新手和专家提供指导。我们的研究跨越三种神经网络架构、两种字符串表示、三种嵌入策略,涵盖十个生物活性数据集,用于分类和回归目的。这本“搭便车指南”不仅强调了某些方法决策的重要性,还为研究人员提供了关于理想选择的实用建议,比如在神经网络架构、分子表示和超参数优化方面。