Heidelberg Institute for Theoretical Studies (HITS gGmbH), 69118 Heidelberg, Germany.
Interdisciplinary Center for Scientific Computing, Heidelberg University, 69120 Heidelberg, Germany.
J Chem Phys. 2023 Jun 7;158(21). doi: 10.1063/5.0151122.
Chemical (molecular, quantum) machine learning relies on representing molecules in unique and informative ways. Here, we present the matrix of orthogonalized atomic orbital coefficients (MAOC) as a quantum-inspired molecular and atomic representation containing both structural (composition and geometry) and electronic (charge and spin multiplicity) information. MAOC is based on a cost-effective localization scheme that represents localized orbitals via a predefined set of atomic orbitals. The latter can be constructed from such small atom-centered basis sets as pcseg-0 and STO-3G in conjunction with guess (non-optimized) electronic configuration of the molecule. Importantly, MAOC is suitable for representing monatomic, molecular, and periodic systems and can distinguish compounds with identical compositions and geometries but distinct charges and spin multiplicities. Using principal component analysis, we constructed a more compact but equally powerful version of MAOC-PCX-MAOC. To test the performance of full and reduced MAOC and several other representations (CM, SOAP, SLATM, and SPAHM), we used a kernel ridge regression machine learning model to predict frontier molecular orbital energy levels and ground state single-point energies for chemically diverse neutral and charged, closed- and open-shell molecules from an extended QM7b dataset, as well as two new datasets, N-HPC-1 (N-heteropolycycles) and REDOX (nitroxyl and phenoxyl radicals, carbonyl, and cyano compounds). MAOC affords accuracy that is either similar or superior to other representations for a range of chemical properties and systems.
化学(分子、量子)机器学习依赖于以独特且富有信息量的方式表示分子。在这里,我们提出了正交原子轨道系数矩阵(MAOC),作为一种量子启发的分子和原子表示,包含结构(组成和几何形状)和电子(电荷和自旋多重性)信息。MAOC 基于一种具有成本效益的局域化方案,通过预定义的原子轨道集来表示局域轨道。后者可以由 pcseg-0 和 STO-3G 等小原子中心基组与分子的猜测(非优化)电子构型结合构建而成。重要的是,MAOC 适用于表示单原子、分子和周期性系统,并且可以区分具有相同组成和几何形状但不同电荷和自旋多重性的化合物。使用主成分分析,我们构建了一个更紧凑但同样强大的 MAOC-PCX-MAOC 版本。为了测试完整和简化的 MAOC 以及其他几种表示形式(CM、SOAP、SLATM 和 SPAHM)的性能,我们使用核脊回归机器学习模型来预测来自扩展 QM7b 数据集的化学多样化中性和带电、闭壳和开壳分子的前线分子轨道能级和基态单点能,以及两个新数据集,N-HPC-1(杂多环)和 REDOX(氮氧自由基、苯氧自由基、羰基和氰基化合物)。MAOC 在一系列化学性质和系统中提供了与其他表示形式相似或更优的准确性。