Gómez-Bombarelli Rafael, Wei Jennifer N, Duvenaud David, Hernández-Lobato José Miguel, Sánchez-Lengeling Benjamín, Sheberla Dennis, Aguilera-Iparraguirre Jorge, Hirzel Timothy D, Adams Ryan P, Aspuru-Guzik Alán
Kyulux North America Inc., 10 Post Office Square, Suite 800, Boston, Massachusetts 02109, United States.
Department of Chemistry and Chemical Biology, Harvard University, Cambridge, Massachusetts 02138, United States.
ACS Cent Sci. 2018 Feb 28;4(2):268-276. doi: 10.1021/acscentsci.7b00572. Epub 2018 Jan 12.
We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous representations also allow the use of powerful gradient-based optimization to efficiently guide the search for optimized functional compounds. We demonstrate our method in the domain of drug-like molecules and also in a set of molecules with fewer that nine heavy atoms.
我们报告了一种在分子的离散表示与多维连续表示之间进行相互转换的方法。该模型使我们能够通过化合物的开放式空间生成新分子,以进行高效的探索和优化。一个深度神经网络在数十万种现有化学结构上进行训练,以构建三个耦合函数:一个编码器、一个解码器和一个预测器。编码器将分子的离散表示转换为实值连续向量,解码器再将这些连续向量转换回离散的分子表示。预测器从分子的潜在连续向量表示中估计化学性质。分子的连续表示使我们能够通过在潜在空间中执行简单操作来自动生成新的化学结构,例如解码随机向量、扰动已知化学结构或在分子之间进行插值。连续表示还允许使用强大的基于梯度的优化方法来有效地指导对优化功能化合物的搜索。我们在类药物分子领域以及一组重原子数少于九个的分子中展示了我们的方法。