Gao Xinyue, Baimacheva Natalia, Aires-de-Sousa Joao
Faculty of Sciences, Université Paris Cité, 75013 Paris, France.
Faculty of Chemistry, University of Strasbourg, 4, Blaise Pascal Str., 67081 Strasbourg, France.
Molecules. 2024 Aug 22;29(16):3969. doi: 10.3390/molecules29163969.
A variational heteroencoder based on recurrent neural networks, trained with SMILES linear notations of molecular structures, was used to derive the following atomic descriptors: delta latent space vectors (DLSVs) obtained from the original SMILES of the whole molecule and the SMILES of the same molecule with the target atom replaced. Different replacements were explored, namely, changing the atomic element, replacement with a character of the model vocabulary not used in the training set, or the removal of the target atom from the SMILES. Unsupervised mapping of the DLSV descriptors with t-distributed stochastic neighbor embedding (t-SNE) revealed a remarkable clustering according to the atomic element, hybridization, atomic type, and aromaticity. Atomic DLSV descriptors were used to train machine learning (ML) models to predict F NMR chemical shifts. An R of up to 0.89 and mean absolute errors of up to 5.5 ppm were obtained for an independent test set of 1046 molecules with random forests or a gradient-boosting regressor. Intermediate representations from a Transformer model yielded comparable results. Furthermore, DLSVs were applied as molecular operators in the latent space: the DLSV of a halogenation (H→F substitution) was summed to the LSVs of 4135 new molecules with no fluorine atom and decoded into SMILES, yielding 99% of valid SMILES, with 75% of the SMILES incorporating fluorine and 56% of the structures incorporating fluorine with no other structural change.
基于循环神经网络的变分自编码器,使用分子结构的SMILES线性表示法进行训练,用于推导以下原子描述符:从整个分子的原始SMILES以及目标原子被替换后的同一分子的SMILES中获得的δ潜在空间向量(DLSV)。研究了不同的替换方式,即改变原子元素、用训练集中未使用的模型词汇字符进行替换,或从SMILES中去除目标原子。使用t分布随机邻域嵌入(t-SNE)对DLSV描述符进行无监督映射,结果显示根据原子元素、杂化、原子类型和芳香性有明显的聚类。原子DLSV描述符用于训练机器学习(ML)模型以预测¹⁹F NMR化学位移。对于1046个分子的独立测试集,使用随机森林或梯度提升回归器时,相关系数R高达0.89,平均绝对误差高达5.5 ppm。来自Transformer模型的中间表示产生了可比的结果。此外,DLSV在潜在空间中用作分子算子:将卤化(H→F替换)的DLSV与4135个没有氟原子的新分子的LSV相加,并解码为SMILES,得到99%的有效SMILES,其中75%的SMILES包含氟,56%的结构包含氟且没有其他结构变化。