McCarthy Michael, Lee Kin Long Kelvin
Center for Astrophysics | Harvard & Smithsonian, 60 Garden Street, Cambridge, Massachusetts 02138, United States.
J Phys Chem A. 2020 Apr 16;124(15):3002-3017. doi: 10.1021/acs.jpca.0c01376. Epub 2020 Apr 7.
A proof-of-concept framework for identifying molecules of unknown elemental composition and structure using experimental rotational data and probabilistic deep learning is presented. Using a minimal set of input data determined experimentally, we describe four neural network architectures that yield information to assist in the identification of an unknown molecule. The first architecture translates spectroscopic parameters into Coulomb matrix eigenspectra as a method of recovering chemical and structural information encoded in the rotational spectrum. The eigenspectrum is subsequently used by three deep learning networks to constrain the range of stoichiometries, generate SMILES strings, and predict the most likely functional groups present in the molecule. In each model, we utilize dropout layers as an approximation to Bayesian sampling, which subsequently generates probabilistic predictions from otherwise deterministic models. These models are trained on a modestly sized theoretical dataset comprising ∼83 000 unique organic molecules (between 18 and 180 amu) optimized at the ωB97X-D/6-31+G(d) level of theory, where the theoretical uncertainties of the spectoscopic constants are well-understood and used to further augment training. Since chemical and structural properties depend strongly on molecular composition, we divided the dataset into four groups corresponding to pure hydrocarbons, oxygen-bearing species, nitrogen-bearing species, and both oxygen- and nitrogen-bearing species, training each type of network with one of these categories, thus creating "experts" within each domain of molecules. We demonstrate how these models can then be used for practical inference on four molecules and discuss both the strengths and shortcomings of our approach and the future directions these architectures can take.
本文提出了一个概念验证框架,用于利用实验旋转数据和概率深度学习来识别未知元素组成和结构的分子。通过使用一组最少的实验确定的输入数据,我们描述了四种神经网络架构,这些架构能够产生信息以协助识别未知分子。第一种架构将光谱参数转换为库仑矩阵特征谱,作为恢复旋转光谱中编码的化学和结构信息的一种方法。随后,三个深度学习网络使用该特征谱来限制化学计量比的范围、生成SMILES字符串,并预测分子中最可能存在的官能团。在每个模型中,我们使用随机失活层作为贝叶斯采样的近似方法,随后从原本确定性的模型中生成概率预测。这些模型在一个规模适中的理论数据集上进行训练,该数据集包含约83000个独特的有机分子(18至180原子质量单位),这些分子在ωB97X-D/6-31+G(d)理论水平上进行了优化,其中光谱常数的理论不确定性已得到充分理解并用于进一步增强训练。由于化学和结构性质强烈依赖于分子组成,我们将数据集分为四组,分别对应纯烃类、含氧物种类、含氮物种类以及含氧和含氮物种类,使用这些类别之一对每种类型的网络进行训练,从而在每个分子领域内创建“专家”。我们展示了这些模型如何随后用于对四个分子进行实际推断,并讨论了我们方法的优点和缺点以及这些架构未来可以发展的方向。