分子机器学习模型的预测误差低于混合密度泛函理论误差。

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.

作者信息

Faber Felix A, Hutchison Luke, Huang Bing, Gilmer Justin, Schoenholz Samuel S, Dahl George E, Vinyals Oriol, Kearnes Steven, Riley Patrick F, von Lilienfeld O Anatole

机构信息

Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials, Department of Chemistry, University of Basel , Klingelbergstrasse 80, CH-4056 Basel, Switzerland.

Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, United States.

出版信息

J Chem Theory Comput. 2017 Nov 14;13(11):5255-5264. doi: 10.1021/acs.jctc.7b00577. Epub 2017 Oct 10.

DOI:10.1021/acs.jctc.7b00577

PMID:28926232

Abstract

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of 13 electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to ∼118k distinct molecules. Molecular structures and properties at the hybrid density functional theory (DFT) level of theory come from the QM9 database [ Ramakrishnan et al. Sci. Data 2014 , 1 , 140022 ] and include enthalpies and free energies of atomization, HOMO/LUMO energies and gap, dipole moment, polarizability, zero point vibrational energy, heat capacity, and the highest fundamental vibrational frequency. Various molecular representations have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR), and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). Out-of sample errors are strongly dependent on the choice of representation and regressor and molecular property. Electronic properties are typically best accounted for by MG and GC, while energetic properties are better described by HDAD and KRR. The specific combinations with the lowest out-of-sample errors in the ∼118k training set size limit are (free) energies and enthalpies of atomization (HDAD/KRR), HOMO/LUMO eigenvalue and gap (MG/GC), dipole moment (MG/GC), static polarizability (MG/GG), zero point vibrational energy (HDAD/KRR), heat capacity at room temperature (HDAD/KRR), and highest fundamental vibrational frequency (BAML/RF). We present numerical evidence that ML model predictions deviate from DFT (B3LYP) less than DFT (B3LYP) deviates from experiment for all properties. Furthermore, out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. The results suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data were available.

摘要

我们研究了为构建有机分子13种电子基态性质的快速机器学习（ML）模型而选择回归变量和分子表示的影响。使用学习曲线评估每个回归变量/表示/性质组合的性能，学习曲线报告样本外误差作为训练集大小的函数，训练集包含多达约11.8万个不同的分子。混合密度泛函理论（DFT）水平下的分子结构和性质来自QM9数据库[Ramakrishnan等人，《科学数据》，2014年，第1卷，140022]，包括原子化焓和自由能、HOMO/LUMO能量和能隙、偶极矩、极化率、零点振动能、热容以及最高基本振动频率。研究了各种分子表示（库仑矩阵、键袋、BAML和ECFP4、分子图（MG）），以及新开发的基于分布的变体，包括距离直方图（HD）、角度直方图（HDA/MARAD）和二面角直方图（HDAD）。回归变量包括线性模型（贝叶斯岭回归（BR）和带弹性网络正则化的线性回归（EN））、随机森林（RF）、核岭回归（KRR）以及两种类型的神经网络，图卷积（GC）和门控图网络（GG）。样本外误差强烈依赖于表示、回归变量和分子性质的选择。电子性质通常由MG和GC最好地描述，而能量性质由HDAD和KRR更好地描述。在约11.8万个训练集大小限制下，具有最低样本外误差的特定组合是原子化（自由）能和焓（HDAD/KRR）、HOMO/LUMO本征值和能隙（MG/GC）、偶极矩（MG/GC）、静态极化率（MG/GG）、零点振动能（HDAD/KRR）、室温下的热容（HDAD/KRR）以及最高基本振动频率（BAML/RF）。我们提供了数值证据，表明对于所有性质，ML模型预测与DFT（B3LYP）的偏差小于DFT（B3LYP）与实验的偏差。此外，相对于混合DFT参考的样本外预测误差与化学精度相当或接近。结果表明，如果有明确的电子相关量子（或实验）数据，ML模型可能比混合DFT更准确。

相似文献

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.分子机器学习模型的预测误差低于混合密度泛函理论误差。

J Chem Theory Comput. 2017 Nov 14;13(11):5255-5264. doi: 10.1021/acs.jctc.7b00577. Epub 2017 Oct 10.

Alchemical and structural distribution based representation for universal quantum machine learning.基于炼金术和结构分布的通用量子机器学习表示。

J Chem Phys. 2018 Jun 28;148(24):241717. doi: 10.1063/1.5020710.

Communication: Understanding molecular representations in machine learning: The role of uniqueness and target similarity.通讯：理解机器学习中的分子表征：唯一性和目标相似性的作用。

J Chem Phys. 2016 Oct 28;145(16):161102. doi: 10.1063/1.4964627.

Comparison Study on the Prediction of Multiple Molecular Properties by Various Neural Networks.各种神经网络对多种分子性质预测的比较研究。

J Phys Chem A. 2018 Nov 21;122(46):9128-9134. doi: 10.1021/acs.jpca.8b09376. Epub 2018 Nov 13.

Chemical diversity in molecular orbital energy predictions with kernel ridge regression.基于核岭回归的分子轨道能量预测中的化学多样性

J Chem Phys. 2019 May 28;150(20):204121. doi: 10.1063/1.5086105.

Machine Learning Prediction of Nine Molecular Properties Based on the SMILES Representation of the QM9 Quantum-Chemistry Dataset.基于 QM9 量子化学数据集的 SMILES 表示的机器学习对 9 种分子性质的预测。

J Phys Chem A. 2020 Nov 25;124(47):9854-9866. doi: 10.1021/acs.jpca.0c05969. Epub 2020 Nov 11.

Comparison of DFT methods for molecular orbital eigenvalue calculations.用于分子轨道本征值计算的密度泛函理论（DFT）方法比较。

J Phys Chem A. 2007 Mar 1;111(8):1554-61. doi: 10.1021/jp061633o. Epub 2007 Feb 6.

Kernel based quantum machine learning at record rate: Many-body distribution functionals as compact representations.基于核的量子机器学习创纪录速度：多体分布泛函作为紧凑表示。

J Chem Phys. 2023 Jul 21;159(3). doi: 10.1063/5.0152215.

Many Molecular Properties from One Kernel in Chemical Space.化学空间中一个内核的多种分子性质。

Chimia (Aarau). 2015;69(4):182-6. doi: 10.2533/chimia.2015.182.

Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals.预测最高占据分子轨道（HOMO）和最低未占据分子轨道（LUMO）的密度泛函理论B3LYP能量的机器学习方法。

J Chem Inf Model. 2017 Jan 23;57(1):11-21. doi: 10.1021/acs.jcim.6b00340. Epub 2016 Dec 29.

引用本文的文献

Predictive design of crystallographic chiral separation.晶体学手性分离的预测设计。

Nat Commun. 2025 Aug 26;16(1):7977. doi: 10.1038/s41467-025-62825-4.

Enhancing Machine Learning Potentials through Transfer Learning across Chemical Elements.通过跨化学元素的迁移学习提升机器学习潜力。

J Chem Inf Model. 2025 Jul 28;65(14):7406-7414. doi: 10.1021/acs.jcim.5c00293. Epub 2025 Jul 7.

RGBChem: Image-Like Representation of Chemical Compounds for Property Prediction.RGBChem：用于性质预测的化合物图像式表示法。

J Chem Theory Comput. 2025 May 27;21(10):5322-5333. doi: 10.1021/acs.jctc.5c00291. Epub 2025 May 12.

Trustworthy Inverse Molecular Design via Alignment with Molecular Dynamics.通过与分子动力学比对实现可靠的逆分子设计

Adv Sci (Weinh). 2025 Jul;12(27):e2416356. doi: 10.1002/advs.202416356. Epub 2025 May 8.

Machine Learning Classification of Chirality and Optical Rotation Using a Simple One-Hot Encoded Cartesian Coordinate Molecular Representation.使用简单的单热编码笛卡尔坐标分子表示法对手性和旋光性进行机器学习分类

J Chem Inf Model. 2025 May 12;65(9):4281-4292. doi: 10.1021/acs.jcim.4c02374. Epub 2025 May 1.

Machine learning modeling of electronic spectra and thermodynamic stability for a comprehensive chemical space of melanin.针对黑色素综合化学空间的电子光谱和热力学稳定性的机器学习建模。

Chem Sci. 2025 Apr 22;16(21):9230-9. doi: 10.1039/d5sc00046g.

Machine learning applications for thermochemical and kinetic property prediction.用于热化学和动力学性质预测的机器学习应用。

Rev Chem Eng. 2024 Nov 29;41(4):419-449. doi: 10.1515/revce-2024-0027. eCollection 2025 May.

CoDNet: controlled diffusion network for structure-based drug design.CoDNet：用于基于结构的药物设计的可控扩散网络。

Bioinform Adv. 2025 Feb 19;5(1):vbaf031. doi: 10.1093/bioadv/vbaf031. eCollection 2025.

Adapting hybrid density functionals with machine learning.通过机器学习调整杂化密度泛函

Sci Adv. 2025 Jan 31;11(5):eadt7769. doi: 10.1126/sciadv.adt7769.

Human interpretable structure-property relationships in chemistry using explainable machine learning and large language models.利用可解释机器学习和大语言模型建立化学中人类可解释的结构-性质关系。

Commun Chem. 2025 Jan 14;8(1):11. doi: 10.1038/s42004-024-01393-y.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

分子机器学习模型的预测误差低于混合密度泛函理论误差。

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献