Laboratory of Chemoinformatics , UMR 7177 University of Strasbourg/CNRS , 4 rue B. Pascal , 67000 Strasbourg , France.
Faculty of Physics , M.V. Lomonosov Moscow State University , Leninskie Gory , Moscow 19991 , Russia.
J Chem Inf Model. 2019 Mar 25;59(3):1182-1196. doi: 10.1021/acs.jcim.8b00751. Epub 2019 Mar 5.
Here we show that Generative Topographic Mapping (GTM) can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network with Bidirectional Long Short-Term Memory layers and trained it on the SMILES strings from ChEMBL23. Very high reconstruction rates of the test set molecules were achieved (>98%), which are comparable to the ones reported in related publications. Using GTM, we have visualized the autoencoder latent space on the two-dimensional topographic map. Targeted map zones can be used for generating novel molecular structures by sampling associated latent space points and decoding them to SMILES. The sampling method based on a genetic algorithm was introduced to optimize compound properties "on the fly". The generated focused molecular libraries were shown to contain original and a priori feasible compounds which, pending actual synthesis and testing, showed encouraging behavior in independent structure-based affinity estimation procedures (pharmacophore matching, docking).
在这里,我们展示了生成拓扑映射(GTM)可用于探索基于 SMILES 的自动编码器的潜在空间,并生成有针对性的分子文库。我们构建了一个具有双向长短期记忆层的序列到序列神经网络,并在 ChEMBL23 的 SMILES 字符串上对其进行了训练。测试集分子的重建率非常高(>98%),与相关文献中的报道相当。使用 GTM,我们在二维地形图上可视化了自动编码器的潜在空间。通过对相关潜在空间点进行采样和解码为 SMILES,可以在目标地图区域生成新的分子结构。引入了基于遗传算法的采样方法,以“实时”优化化合物性质。所生成的有针对性的分子文库包含原始的和先验可行的化合物,这些化合物在实际合成和测试之前,在独立的基于结构的亲和力估计程序(药效团匹配、对接)中表现出了令人鼓舞的行为。