通过转换等效化学表示来学习连续且数据驱动的分子描述符。

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.

作者信息

Winter Robin, Montanari Floriane, Noé Frank, Clevert Djork-Arné

机构信息

Department of Bioinformatics , Bayer AG , Berlin , Germany . Email:

Department of Mathematics and Computer Science , Freie Universität Berlin , Berlin , Germany.

出版信息

Chem Sci. 2018 Nov 19;10(6):1692-1701. doi: 10.1039/c8sc04175j. eCollection 2019 Feb 14.

DOI:10.1039/c8sc04175j

PMID:30842833

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6368215/

Abstract

There has been a recent surge of interest in using machine learning across chemical space in order to predict properties of molecules or design molecules and materials with the desired properties. Most of this work relies on defining clever feature representations, in which the chemical graph structure is encoded in a uniform way such that predictions across chemical space can be made. In this work, we propose to exploit the powerful ability of deep neural networks to learn a feature representation from low-level encodings of a huge corpus of chemical structures. Our model borrows ideas from neural machine translation: it translates between two semantically equivalent but syntactically different representations of molecular structures, compressing the meaningful information both representations have in common in a low-dimensional representation vector. Once the model is trained, this representation can be extracted for any new molecule and utilized as a descriptor. In fair benchmarks with respect to various human-engineered molecular fingerprints and graph-convolution models, our method shows competitive performance in modelling quantitative structure-activity relationships in all analysed datasets. Additionally, we show that our descriptor significantly outperforms all baseline molecular fingerprints in two ligand-based virtual screening tasks. Overall, our descriptors show the most consistent performances in all experiments. The continuity of the descriptor space and the existence of the decoder that permits deducing a chemical structure from an embedding vector allow for exploration of the space and open up new opportunities for compound optimization and idea generation.

摘要

最近，利用机器学习在化学空间中预测分子性质或设计具有所需性质的分子和材料的兴趣激增。这项工作大多依赖于定义巧妙的特征表示，其中化学图结构以统一的方式进行编码，以便能够对化学空间进行预测。在这项工作中，我们建议利用深度神经网络的强大能力，从大量化学结构的低级编码中学习特征表示。我们的模型借鉴了神经机器翻译的思想：它在分子结构的两种语义等效但句法不同的表示之间进行转换，将两种表示共有的有意义信息压缩到一个低维表示向量中。一旦模型经过训练，就可以为任何新分子提取这种表示并用作描述符。在针对各种人工设计的分子指纹和图卷积模型的公平基准测试中，我们的方法在所有分析数据集中的定量构效关系建模方面表现出具有竞争力的性能。此外，我们表明，在两项基于配体的虚拟筛选任务中，我们的描述符明显优于所有基线分子指纹。总体而言，我们的描述符在所有实验中表现出最一致的性能。描述符空间的连续性以及允许从嵌入向量推导出化学结构的解码器的存在，为探索该空间提供了可能，并为化合物优化和新想法的产生开辟了新机会。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/88bc/6368215/1eb062446e79/c8sc04175j-f1.jpg

相似文献

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.通过转换等效化学表示来学习连续且数据驱动的分子描述符。

Chem Sci. 2018 Nov 19;10(6):1692-1701. doi: 10.1039/c8sc04175j. eCollection 2019 Feb 14.

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.在预测有机反应性、选择性和化学性质方面，工程化和学习的分子表示的重要性。

Acc Chem Res. 2021 Feb 16;54(4):827-836. doi: 10.1021/acs.accounts.0c00745. Epub 2021 Feb 3.

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.使用数据驱动的分子连续表示法进行自动化学设计。

ACS Cent Sci. 2018 Feb 28;4(2):268-276. doi: 10.1021/acscentsci.7b00572. Epub 2018 Jan 12.

Geometry-Based Molecular Generation With Deep Constrained Variational Autoencoder.基于几何的深度约束变分自编码器分子生成

IEEE Trans Neural Netw Learn Syst. 2024 Apr;35(4):4852-4861. doi: 10.1109/TNNLS.2022.3147790. Epub 2024 Apr 4.

Employing Molecular Conformations for Ligand-Based Virtual Screening with Equivariant Graph Neural Network and Deep Multiple Instance Learning.利用基于分子构象的等价图神经网络和深度多重实例学习进行配体虚拟筛选。

Molecules. 2023 Aug 9;28(16):5982. doi: 10.3390/molecules28165982.

Conformational Space Profiling Enhances Generic Molecular Representation for AI-Powered Ligand-Based Drug Discovery.构象空间分析增强了人工智能驱动的基于配体的药物发现中的通用分子表示。

Adv Sci (Weinh). 2024 Oct;11(40):e2403998. doi: 10.1002/advs.202403998. Epub 2024 Aug 29.

Improving Chemical Autoencoder Latent Space and Molecular Generation Diversity with Heteroencoders.用异构图编码器改进化学自动编码器潜在空间和分子生成多样性。

Biomolecules. 2018 Oct 30;8(4):131. doi: 10.3390/biom8040131.

Co-Embedding of Nodes and Edges With Graph Neural Networks.节点和边的图神经网络联合嵌入。

IEEE Trans Pattern Anal Mach Intell. 2023 Jun;45(6):7075-7086. doi: 10.1109/TPAMI.2020.3029762. Epub 2023 May 5.

A comprehensive comparison of molecular feature representations for use in predictive modeling.综合比较用于预测建模的分子特征表示。

Comput Biol Med. 2021 Mar;130:104197. doi: 10.1016/j.compbiomed.2020.104197. Epub 2021 Jan 9.

An Ensemble Structure and Physicochemical (SPOC) Descriptor for Machine-Learning Prediction of Chemical Reaction and Molecular Properties.用于机器学习预测化学反应和分子性质的集成结构和物理化学（SPOC）描述符。

Chemphyschem. 2022 Jul 19;23(14):e202200255. doi: 10.1002/cphc.202200255. Epub 2022 May 19.

引用本文的文献

Evaluation of chirality descriptors derived from SMILES heteroencoders.基于SMILES异编码器的手性描述符评估。

J Cheminform. 2025 Aug 31;17(1):137. doi: 10.1186/s13321-025-01080-7.

Mixture of experts for multitask learning in cardiotoxicity assessment.用于心脏毒性评估中多任务学习的专家混合模型。

J Cheminform. 2025 Aug 29;17(1):135. doi: 10.1186/s13321-025-01072-7.

Design and optimization of novel succinate dehydrogenase inhibitors against agricultural fungi based on transformer model.基于Transformer模型的新型抗农业真菌琥珀酸脱氢酶抑制剂的设计与优化

Mol Divers. 2025 Aug 19. doi: 10.1007/s11030-025-11323-2.

Pushing the boundaries of few-shot learning for low-data drug discovery with a Bayesian meta-learning hypernetwork framework.利用贝叶斯元学习超网络框架拓展少样本学习在低数据药物发现中的边界。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf408.

Expediting the discovery of promising photothermal cyanine molecules through a transfer learning approach.通过迁移学习方法加速有前景的光热花菁分子的发现。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf413.

Machine learning analysis of ARVC informed by sodium channel protein-based interactome networks.基于钠通道蛋白相互作用组网络的致心律失常性右室心肌病机器学习分析

Front Pharmacol. 2025 Jul 23;16:1611342. doi: 10.3389/fphar.2025.1611342. eCollection 2025.

VitroBert: modeling DILI by pretraining BERT on in vitro data.VitroBert：通过在体外数据上预训练BERT对药物性肝损伤进行建模。

J Cheminform. 2025 Aug 6;17(1):119. doi: 10.1186/s13321-025-01048-7.

Simulations and active learning enable efficient identification of an experimentally-validated broad coronavirus inhibitor.模拟和主动学习能够有效识别经实验验证的广谱冠状病毒抑制剂。

Nat Commun. 2025 Jul 29;16(1):6949. doi: 10.1038/s41467-025-62139-5.

MCST-AFN: A Multichannel Spatiotemporal Feature Adaptive Fusion Network Framework Based on a Low-Fidelity Molecular Dynamics Model.MCST-AFN：一种基于低精度分子动力学模型的多通道时空特征自适应融合网络框架

ACS Omega. 2025 Jul 11;10(28):30232-30249. doi: 10.1021/acsomega.5c01443. eCollection 2025 Jul 22.

CMOMO: a deep multi-objective optimization framework for constrained molecular multi-property optimization.CMOMO：一种用于受限分子多性质优化的深度多目标优化框架。

Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf335.

本文引用的文献

Classification models for identifying substances exhibiting acute contact toxicity in honeybees (Apis mellifera).用于鉴定对蜜蜂（Apis mellifera）具有急性接触毒性的物质的分类模型。

SAR QSAR Environ Res. 2018 Sep;29(9):743-754. doi: 10.1080/1062936X.2018.1513953.

MoleculeNet: a benchmark for molecular machine learning.分子网络：分子机器学习的一个基准

Chem Sci. 2017 Oct 31;9(2):513-530. doi: 10.1039/c7sc02664a. eCollection 2018 Jan 14.

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules.使用数据驱动的分子连续表示法进行自动化学设计。

ACS Cent Sci. 2018 Feb 28;4(2):268-276. doi: 10.1021/acscentsci.7b00572. Epub 2018 Jan 12.

Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks.使用递归神经网络生成用于药物发现的聚焦分子库。

ACS Cent Sci. 2018 Jan 24;4(1):120-131. doi: 10.1021/acscentsci.7b00512. Epub 2017 Dec 28.

The rise of deep learning in drug discovery.深度学习在药物发现中的崛起。

Drug Discov Today. 2018 Jun;23(6):1241-1250. doi: 10.1016/j.drudis.2018.01.039. Epub 2018 Jan 31.

Application of Generative Autoencoder in De Novo Molecular Design.生成式自动编码器在从头分子设计中的应用。

Mol Inform. 2018 Jan;37(1-2). doi: 10.1002/minf.201700123. Epub 2017 Dec 13.

Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set.超越炒作：使用ChEMBL生物活性基准集，深度神经网络优于现有方法。

J Cheminform. 2017 Aug 14;9(1):45. doi: 10.1186/s13321-017-0232-0.

Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches.基于配体方法的β-分泌酶1（BACE-1）抑制剂的计算建模

J Chem Inf Model. 2016 Oct 24;56(10):1936-1949. doi: 10.1021/acs.jcim.6b00290. Epub 2016 Oct 10.

PubChem Substance and Compound databases.美国国立医学图书馆化学物质数据库和化合物数据库。

Nucleic Acids Res. 2016 Jan 4;44(D1):D1202-13. doi: 10.1093/nar/gkv951. Epub 2015 Sep 22.

Activity, assay and target data curation and quality in the ChEMBL database.ChEMBL数据库中的活性、测定及靶点数据整理与质量

J Comput Aided Mol Des. 2015 Sep;29(9):885-96. doi: 10.1007/s10822-015-9860-5. Epub 2015 Jul 23.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过转换等效化学表示来学习连续且数据驱动的分子描述符。

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献