用于生物活性预测的深度化学语言处理指南。

A hitchhiker's guide to deep chemical language processing for bioactivity prediction.

作者信息

Özçelik Rıza, Grisoni Francesca

机构信息

Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering Eindhoven Netherlands

Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht Netherlands.

出版信息

Digit Discov. 2024 Dec 16;4(2):316-325. doi: 10.1039/d4dd00311j. eCollection 2025 Feb 12.

DOI:10.1039/d4dd00311j

PMID:39726698

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11667676/

Abstract

Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (, Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, , in terms of neural network architectures, molecular representations, and hyperparameter optimization.

摘要

深度学习显著加速了药物发现，“化学语言”处理（CLP）作为一种突出的方法应运而生。CLP方法通过类似于自然语言处理的方法，从分子字符串表示（如简化分子输入线输入系统[SMILES]和自引用嵌入字符串[SELFIES]）中学习。尽管它们的重要性日益增加，但训练预测性CLP模型绝非易事，因为它涉及许多“花里胡哨的东西”。在这里，我们分析了CLP的关键要素，并为新手和专家提供指导。我们的研究跨越三种神经网络架构、两种字符串表示、三种嵌入策略，涵盖十个生物活性数据集，用于分类和回归目的。这本“搭便车指南”不仅强调了某些方法决策的重要性，还为研究人员提供了关于理想选择的实用建议，比如在神经网络架构、分子表示和超参数优化方面。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/bed0/11667676/d04ef4778857/d4dd00311j-f1.jpg

相似文献

A hitchhiker's guide to deep chemical language processing for bioactivity prediction.用于生物活性预测的深度化学语言处理指南。

Digit Discov. 2024 Dec 16;4(2):316-325. doi: 10.1039/d4dd00311j. eCollection 2025 Feb 12.

Positional embeddings and zero-shot learning using BERT for molecular-property prediction.使用BERT进行位置嵌入和零样本学习以预测分子性质

J Cheminform. 2025 Feb 5;17(1):17. doi: 10.1186/s13321-025-00959-9.

SELFIES and the future of molecular string representations.自拍与分子串表示法的未来。

Patterns (N Y). 2022 Oct 14;3(10):100588. doi: 10.1016/j.patter.2022.100588.

Siamese Recurrent Neural Network with a Self-Attention Mechanism for Bioactivity Prediction.具有自注意力机制的暹罗递归神经网络用于生物活性预测

ACS Omega. 2021 Apr 15;6(16):11086-11094. doi: 10.1021/acsomega.1c01266. eCollection 2021 Apr 27.

A hitchhiker's guide to diffusion tensor imaging.扩散张量成像入门指南。

Front Neurosci. 2013 Mar 12;7:31. doi: 10.3389/fnins.2013.00031. eCollection 2013.

Reconstruction of lossless molecular representations from fingerprints.从指纹重建无损分子表示。

J Cheminform. 2023 Feb 23;15(1):26. doi: 10.1186/s13321-023-00693-0.

Recent advances in the self-referencing embedded strings (SELFIES) library.自引用嵌入字符串（SELFIES）库的最新进展。

Digit Discov. 2023 Jul 1;2(4):897-908. doi: 10.1039/d3dd00044c. eCollection 2023 Aug 8.

fragSMILES as a chemical string notation for advanced fragment and chirality representation.fragSMILES作为一种用于高级片段和手性表示的化学字符串表示法。

Commun Chem. 2025 Jan 29;8(1):26. doi: 10.1038/s42004-025-01423-3.

Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.比较 SMILES 和 SELFIES 标记化以增强化学语言建模。

Sci Rep. 2024 Oct 23;14(1):25016. doi: 10.1038/s41598-024-76440-8.

Learning to SMILES: BAN-based strategies to improve latent representation learning from molecules.从分子中学习 SMILES：基于 BAN 的策略来改进潜在表示学习。

Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab327.

引用本文的文献

Chemical Language Model Linker: Blending Text and Molecules with Modular Adapters.化学语言模型链接器：通过模块化适配器融合文本与分子

J Chem Inf Model. 2025 Sep 8;65(17):8944-8956. doi: 10.1021/acs.jcim.5c00853. Epub 2025 Aug 21.

Identifying 14-3-3 interactome binding sites with deep learning.用深度学习识别14-3-3相互作用组结合位点。

Digit Discov. 2025 Aug 8. doi: 10.1039/d5dd00132c.

Generative Deep Learning for de Novo Drug Design─A Chemical Space Odyssey.用于从头药物设计的生成式深度学习——一场化学空间奥德赛。

J Chem Inf Model. 2025 Jul 28;65(14):7352-7372. doi: 10.1021/acs.jcim.5c00641. Epub 2025 Jul 9.

Deep Supramolecular Language Processing for Co-Crystal Prediction.用于共晶预测的深度超分子语言处理

Angew Chem Int Ed Engl. 2025 Jul;64(29):e202507835. doi: 10.1002/anie.202507835. Epub 2025 May 30.

Chemical Language Model Linker: blending text and molecules with modular adapters.化学语言模型链接器：通过模块化适配器融合文本与分子。

ArXiv. 2025 Jun 13:arXiv:2410.20182v3.

peptidy: a light-weight Python library for peptide representation in machine learning.Peptidy：一个用于机器学习中肽表示的轻量级Python库。

Bioinform Adv. 2025 Mar 21;5(1):vbaf058. doi: 10.1093/bioadv/vbaf058. eCollection 2025.

本文引用的文献

t-SMILES: a fragment-based molecular representation framework for de novo ligand design.t-SMILES：一种用于从头设计配体的基于片段的分子表示框架。

Nat Commun. 2024 Jun 11;15(1):4993. doi: 10.1038/s41467-024-49388-6.

Deep learning for low-data drug discovery: Hurdles and opportunities.用于低数据量药物发现的深度学习：障碍与机遇。

Curr Opin Struct Biol. 2024 Jun;86:102818. doi: 10.1016/j.sbi.2024.102818. Epub 2024 Apr 25.

A comprehensive review of the recent advances on predicting drug-target affinity based on deep learning.基于深度学习预测药物-靶点亲和力的最新进展综述

Front Pharmacol. 2024 Apr 2;15:1375522. doi: 10.3389/fphar.2024.1375522. eCollection 2024.

Structure-Based Drug Discovery with Deep Learning.基于结构的深度学习药物发现。

Chembiochem. 2023 Jul 3;24(13):e202200776. doi: 10.1002/cbic.202200776. Epub 2023 Jun 13.

ReBADD-SE: Multi-objective molecular optimisation using SELFIES fragment and off-policy self-critical sequence training.ReBADD-SE：使用 SELFIES 片段和离策略自临界序列训练进行多目标分子优化。

Comput Biol Med. 2023 May;157:106721. doi: 10.1016/j.compbiomed.2023.106721. Epub 2023 Feb 28.

Chemical language models for de novo drug design: Challenges and opportunities.从头开始设计药物的化学语言模型：挑战与机遇。

Curr Opin Struct Biol. 2023 Apr;79:102527. doi: 10.1016/j.sbi.2023.102527. Epub 2023 Feb 2.

Leveraging molecular structure and bioactivity with chemical language models for de novo drug design.利用分子结构和生物活性与化学语言模型进行从头药物设计。

Nat Commun. 2023 Jan 7;14(1):114. doi: 10.1038/s41467-022-35692-6.

Exposing the Limitations of Molecular Machine Learning with Activity Cliffs.利用活性悬崖揭示分子机器学习的局限性。

J Chem Inf Model. 2022 Dec 12;62(23):5938-5951. doi: 10.1021/acs.jcim.2c01073. Epub 2022 Dec 1.

SELFIES and the future of molecular string representations.自拍与分子串表示法的未来。

Patterns (N Y). 2022 Oct 14;3(10):100588. doi: 10.1016/j.patter.2022.100588.

Exploiting pretrained biochemical language models for targeted drug design.利用预先训练的生化语言模型进行靶向药物设计。

Bioinformatics. 2022 Sep 16;38(Suppl_2):ii155-ii161. doi: 10.1093/bioinformatics/btac482.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

用于生物活性预测的深度化学语言处理指南。

A hitchhiker's guide to deep chemical language processing for bioactivity prediction.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献