El-Samman Amer Marwan, De Baerdemacker Stijn
University of New Brunswick, Department of Chemistry. 30 Dineen Dr Fredericton Canada
University of New Brunswick, Department of Mathematics and Statistics. 30 Dineen Dr Fredericton Canada
Chem Sci. 2025 Apr 23. doi: 10.1039/d4sc05655h.
In deep learning methods, especially in the context of chemistry, there is an increasing urgency to uncover the hidden learning mechanisms often dubbed as "black box." In this work, we show that graph models built on computational chemical data behave similar to natural language processing (NLP) models built on text data. Crucially, we show that atom-embeddings, a.k.a atom-parsed graph neural activation patterns, exhibit arithmetic properties that represent valid reaction formulas. This is very similar to how word-embeddings can be combined to make word analogies, thus preserving the semantic meaning behind the words, as in the famous example "King" - "Man" + "Woman" = "Queen." For instance, we show how the reaction from an alcohol to a carbonyl is represented by a constant vector in the embedding space, implicitly representing "-H." This vector is independent from the particular carbonyl reactant and alcohol product and represents a consistent chemical transformation. Other directions in the embedding space are synonymous with distinct chemical changes (ex. the tautomerization direction). In contrast to natural language processing, we can explain the observed chemical analogies using algebraic manipulations on the local chemical composition that surrounds each atom-embedding. Furthermore, the observations find applications in transfer learning, for instance in the formal structure and prediction of atomistic properties, such as H-NMR and C-NMR. This work is in line with the recent push for interpretable explanations to graph neural network modeling of chemistry and uncovers a latent model of chemistry that is highly structured, consistent, and analogous to chemical syntax.
在深度学习方法中,尤其是在化学领域,揭示通常被称为“黑匣子”的隐藏学习机制的紧迫性日益增加。在这项工作中,我们表明基于计算化学数据构建的图模型的行为与基于文本数据构建的自然语言处理(NLP)模型相似。至关重要的是,我们表明原子嵌入(即原子解析的图神经激活模式)表现出代表有效反应式的算术属性。这与词嵌入如何组合以形成词类比非常相似,从而保留了词背后的语义含义,就像著名的例子“国王”-“男人”+“女人”=“女王”一样。例如,我们展示了从醇到羰基的反应如何由嵌入空间中的一个常数向量表示,隐含地表示“-H”。这个向量独立于特定的羰基反应物和醇产物,代表了一种一致的化学转化。嵌入空间中的其他方向与不同的化学变化同义(例如互变异构方向)。与自然语言处理不同,我们可以使用围绕每个原子嵌入的局部化学组成的代数运算来解释观察到的化学类比。此外,这些观察结果在迁移学习中有应用,例如在原子性质(如H-NMR和C-NMR)的形式结构和预测中。这项工作与最近对化学图神经网络建模的可解释性解释的推动相一致,并揭示了一种高度结构化、一致且类似于化学句法的潜在化学模型。