Mazouin Bernard, Schöpfer Alexandre Alain, von Lilienfeld O Anatole
University of Vienna, Faculty of Physics and Vienna Doctoral School in Physics Kolingasse 14-16 1090 Vienna Austria.
Department of Chemistry, University of Basel Klingelbergstrasse 70 4056 Basel Switzerland
Mater Adv. 2022 Sep 20;3(22):8306-8316. doi: 10.1039/d2ma00742h. eCollection 2022 Nov 14.
Despite their relevance for organic electronics, quantum machine learning (QML) models of molecular electronic properties, such as HOMO-LUMO-gaps, often struggle to achieve satisfying data-efficiency as measured by decreasing prediction errors for increasing training set sizes. We demonstrate that partitioning training sets into different chemical classes prior to training results in independently trained QML models with overall reduced training data needs. For organic molecules drawn from previously published QM7 and QM9-data-sets we have identified and exploited three relevant classes corresponding to compounds containing either aromatic rings and carbonyl groups, or single unsaturated bonds, or saturated bonds The selected QML models of band-gaps (considered at GW and hybrid DFT levels of theory) reach mean absolute prediction errors of ∼0.1 eV for up to an order of magnitude fewer training molecules than for QML models trained on randomly selected molecules. Comparison to Δ-QML models of band-gaps indicates that selected QML exhibit superior data-efficiency. Our findings suggest that selected QML, based on simple classifications prior to training, could help to successfully tackle challenging quantum property screening tasks of large libraries with high fidelity and low computational burden.
尽管分子电子性质的量子机器学习(QML)模型(如最高占据分子轨道-最低未占据分子轨道能隙)与有机电子学相关,但这些模型往往难以实现令人满意的数据效率,这可通过随着训练集规模增加预测误差减小来衡量。我们证明,在训练前将训练集划分为不同化学类别,会得到独立训练的QML模型,且总体训练数据需求降低。对于从先前发表的QM7和QM9数据集提取的有机分子,我们识别并利用了三个相关类别,分别对应含有芳环和羰基、或单不饱和键、或饱和键的化合物。所选的带隙QML模型(在GW和杂化密度泛函理论水平下考虑)对于训练分子数量比在随机选择分子上训练的QML模型少一个数量级的情况,达到了约0.1 eV的平均绝对预测误差。与带隙的Δ-QML模型比较表明,所选的QML表现出卓越的数据效率。我们的研究结果表明,基于训练前简单分类的所选QML,有助于成功应对大型库中具有挑战性的量子性质筛选任务,且具有高保真度和低计算负担。