Suppr超能文献

分子机器学习模型的预测误差低于混合密度泛函理论误差。

Prediction Errors of Molecular Machine Learning Models Lower than Hybrid DFT Error.

作者信息

Faber Felix A, Hutchison Luke, Huang Bing, Gilmer Justin, Schoenholz Samuel S, Dahl George E, Vinyals Oriol, Kearnes Steven, Riley Patrick F, von Lilienfeld O Anatole

机构信息

Institute of Physical Chemistry and National Center for Computational Design and Discovery of Novel Materials, Department of Chemistry, University of Basel , Klingelbergstrasse 80, CH-4056 Basel, Switzerland.

Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, United States.

出版信息

J Chem Theory Comput. 2017 Nov 14;13(11):5255-5264. doi: 10.1021/acs.jctc.7b00577. Epub 2017 Oct 10.

Abstract

We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of 13 electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to ∼118k distinct molecules. Molecular structures and properties at the hybrid density functional theory (DFT) level of theory come from the QM9 database [ Ramakrishnan et al. Sci. Data 2014 , 1 , 140022 ] and include enthalpies and free energies of atomization, HOMO/LUMO energies and gap, dipole moment, polarizability, zero point vibrational energy, heat capacity, and the highest fundamental vibrational frequency. Various molecular representations have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR), and two types of neural networks, graph convolutions (GC) and gated graph networks (GG). Out-of sample errors are strongly dependent on the choice of representation and regressor and molecular property. Electronic properties are typically best accounted for by MG and GC, while energetic properties are better described by HDAD and KRR. The specific combinations with the lowest out-of-sample errors in the ∼118k training set size limit are (free) energies and enthalpies of atomization (HDAD/KRR), HOMO/LUMO eigenvalue and gap (MG/GC), dipole moment (MG/GC), static polarizability (MG/GG), zero point vibrational energy (HDAD/KRR), heat capacity at room temperature (HDAD/KRR), and highest fundamental vibrational frequency (BAML/RF). We present numerical evidence that ML model predictions deviate from DFT (B3LYP) less than DFT (B3LYP) deviates from experiment for all properties. Furthermore, out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. The results suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data were available.

摘要

我们研究了为构建有机分子13种电子基态性质的快速机器学习(ML)模型而选择回归变量和分子表示的影响。使用学习曲线评估每个回归变量/表示/性质组合的性能,学习曲线报告样本外误差作为训练集大小的函数,训练集包含多达约11.8万个不同的分子。混合密度泛函理论(DFT)水平下的分子结构和性质来自QM9数据库[Ramakrishnan等人,《科学数据》,2014年,第1卷,140022],包括原子化焓和自由能、HOMO/LUMO能量和能隙、偶极矩、极化率、零点振动能、热容以及最高基本振动频率。研究了各种分子表示(库仑矩阵、键袋、BAML和ECFP4、分子图(MG)),以及新开发的基于分布的变体,包括距离直方图(HD)、角度直方图(HDA/MARAD)和二面角直方图(HDAD)。回归变量包括线性模型(贝叶斯岭回归(BR)和带弹性网络正则化的线性回归(EN))、随机森林(RF)、核岭回归(KRR)以及两种类型的神经网络,图卷积(GC)和门控图网络(GG)。样本外误差强烈依赖于表示、回归变量和分子性质的选择。电子性质通常由MG和GC最好地描述,而能量性质由HDAD和KRR更好地描述。在约11.8万个训练集大小限制下,具有最低样本外误差的特定组合是原子化(自由)能和焓(HDAD/KRR)、HOMO/LUMO本征值和能隙(MG/GC)、偶极矩(MG/GC)、静态极化率(MG/GG)、零点振动能(HDAD/KRR)、室温下的热容(HDAD/KRR)以及最高基本振动频率(BAML/RF)。我们提供了数值证据,表明对于所有性质,ML模型预测与DFT(B3LYP)的偏差小于DFT(B3LYP)与实验的偏差。此外,相对于混合DFT参考的样本外预测误差与化学精度相当或接近。结果表明,如果有明确的电子相关量子(或实验)数据,ML模型可能比混合DFT更准确。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验