综合比较用于预测建模的分子特征表示。

A comprehensive comparison of molecular feature representations for use in predictive modeling.

机构信息

Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia; Jožef Stefan International Postgraduate School, Ljubljana, Slovenia.

The University of Auckland, School of Computer Science, Auckland, New Zealand.

出版信息

Comput Biol Med. 2021 Mar;130:104197. doi: 10.1016/j.compbiomed.2020.104197. Epub 2021 Jan 9.

DOI:10.1016/j.compbiomed.2020.104197

PMID:33429140

Abstract

Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.

摘要

机器学习方法常用于预测分子性质，以加速材料和药物设计。这个过程的一个重要部分是决定如何表示分子。通常，机器学习方法期望用数值向量表示的示例，并且已经提出了许多用于计算分子特征表示的方法。在本文中，我们对不同的分子特征进行了全面的比较，包括指纹和分子描述符等传统方法，以及最近基于神经网络的可学习表示方法。特征表示在 11 个基准数据集上进行了评估，用于预测性质和度量，如致突变性、熔点、活性、溶解度和 IC50。我们的实验表明，在所有基准数据集上，有几个分子特征的性能都非常相似。其中 Spectrophores 的性能在大多数数据集上都明显比其他特征差。来自 PaDEL 库的分子描述符似乎非常适合预测分子的物理性质。尽管它们很简单，但 MACCS 指纹的整体性能非常好。结果表明，与基于专家的表示相比，可学习的表示可以达到竞争性能。然而，特定于任务的表示（图卷积和 Weave 方法）很少提供任何好处，尽管它们的计算要求更高。最后，与单个特征表示相比，组合不同的分子特征表示通常不会显著提高性能。

相似文献

A comprehensive comparison of molecular feature representations for use in predictive modeling.综合比较用于预测建模的分子特征表示。

Comput Biol Med. 2021 Mar;130:104197. doi: 10.1016/j.compbiomed.2020.104197. Epub 2021 Jan 9.

Molecular graph convolutions: moving beyond fingerprints.分子图卷积：超越指纹图谱

J Comput Aided Mol Des. 2016 Aug;30(8):595-608. doi: 10.1007/s10822-016-9938-8. Epub 2016 Aug 24.

Evaluating molecular representations in machine learning models for drug response prediction and interpretability.评估机器学习模型中的分子表示在药物反应预测和可解释性方面的应用。

J Integr Bioinform. 2022 Aug 26;19(3). doi: 10.1515/jib-2022-0006. eCollection 2022 Sep 1.

Using molecular embeddings in QSAR modeling: does it make a difference?在定量构效关系建模中使用分子嵌入：有区别吗？

Brief Bioinform. 2022 Jan 17;23(1). doi: 10.1093/bib/bbab365.

Impact of Chemist-In-The-Loop Molecular Representations on Machine Learning Outcomes.化学家在环分子表示对机器学习结果的影响。

J Chem Inf Model. 2020 Oct 26;60(10):4449-4456. doi: 10.1021/acs.jcim.0c00193. Epub 2020 Aug 18.

Predicting Energetics Materials' Crystalline Density from Chemical Structure by Machine Learning.通过机器学习从化学结构预测能质材料的结晶密度。

J Chem Inf Model. 2021 May 24;61(5):2147-2158. doi: 10.1021/acs.jcim.0c01318. Epub 2021 Apr 26.

Deep Learning Total Energies and Orbital Energies of Large Organic Molecules Using Hybridization of Molecular Fingerprints.使用分子指纹杂交深度学习大型有机分子的总能量和轨道能量。

J Chem Inf Model. 2020 Dec 28;60(12):5971-5983. doi: 10.1021/acs.jcim.0c00687. Epub 2020 Oct 29.

A novel molecular representation with BiGRU neural networks for learning atom.用于学习原子的 BiGRU 神经网络的新型分子表示。

Brief Bioinform. 2020 Dec 1;21(6):2099-2111. doi: 10.1093/bib/bbz125.

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.在预测有机反应性、选择性和化学性质方面，工程化和学习的分子表示的重要性。

Acc Chem Res. 2021 Feb 16;54(4):827-836. doi: 10.1021/acs.accounts.0c00745. Epub 2021 Feb 3.

Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations.通过转换等效化学表示来学习连续且数据驱动的分子描述符。

Chem Sci. 2018 Nov 19;10(6):1692-1701. doi: 10.1039/c8sc04175j. eCollection 2019 Feb 14.

引用本文的文献

Understanding Conformation Importance in Data-Driven Property Prediction Models.理解构象在数据驱动的性质预测模型中的重要性。

J Chem Inf Model. 2025 Apr 14;65(7):3388-3404. doi: 10.1021/acs.jcim.5c00018. Epub 2025 Mar 18.

AISMPred: A Machine Learning Approach for Predicting Anti-Inflammatory Small Molecules.AISMPred：一种预测抗炎小分子的机器学习方法。

Pharmaceuticals (Basel). 2024 Dec 15;17(12):1693. doi: 10.3390/ph17121693.

Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints.排序与切片：一种用于扩展连接性指纹的、比基于哈希的折叠更简单且更优的替代方法。

J Cheminform. 2024 Dec 3;16(1):135. doi: 10.1186/s13321-024-00932-y.

Deciphering Molecular Embeddings with Centered Kernel Alignment.用中心核对准解码分子嵌入。

J Chem Inf Model. 2024 Oct 14;64(19):7303-7312. doi: 10.1021/acs.jcim.4c00837. Epub 2024 Sep 25.

DiPPI: A Curated Data Set for Drug-like Molecules in Protein-Protein Interfaces.DiPPI：蛋白质-蛋白质界面中类药分子的精选数据集。

J Chem Inf Model. 2024 Jul 8;64(13):5041-5051. doi: 10.1021/acs.jcim.3c01905. Epub 2024 Jun 22.

A Comprehensive Comparative Analysis of Deep Learning Based Feature Representations for Molecular Taste Prediction.基于深度学习的分子味觉预测特征表示的综合比较分析

Foods. 2023 Sep 9;12(18):3386. doi: 10.3390/foods12183386.

Combatting over-specialization bias in growing chemical databases.应对不断增长的化学数据库中的过度专业化偏差。

J Cheminform. 2023 May 19;15(1):53. doi: 10.1186/s13321-023-00716-w.

Exploring QSAR models for activity-cliff prediction.探索用于活性悬崖预测的定量构效关系模型。

J Cheminform. 2023 Apr 17;15(1):47. doi: 10.1186/s13321-023-00708-w.

Block-wise Exploration of Molecular Descriptors with Multi-block Orthogonal Component Analysis (MOCA).分块式分子描述符探索与多块正交成分分析（MOCA）。

Mol Inform. 2022 May;41(5):e2100165. doi: 10.1002/minf.202100165. Epub 2021 Dec 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

综合比较用于预测建模的分子特征表示。

A comprehensive comparison of molecular feature representations for use in predictive modeling.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献