College of Computer Science, Sichuan University, Chengdu610064, China.
College of Chemistry, Sichuan University, Chengdu610064, China.
J Chem Inf Model. 2022 Oct 24;62(20):4873-4887. doi: 10.1021/acs.jcim.2c00997. Epub 2022 Aug 23.
Motivated by the challenging of deep learning on the low data regime and the urgent demand for intelligent design on highly energetic materials, we explore a correlated deep learning framework, which consists of three recurrent neural networks (RNNs) correlated by the transfer learning strategy, to efficiently generate new energetic molecules with a high detonation velocity in the case of very limited data available. To avoid the dependence on the external big data set, data augmentation by fragment shuffling of 303 energetic compounds is utilized to produce 500,000 molecules to pretrain RNN, through which the model can learn sufficient structure knowledge. Then the pretrained RNN is fine-tuned by focusing on the 303 energetic compounds to generate 7153 molecules similar to the energetic compounds. In order to more reliably screen the molecules with a high detonation velocity, the SMILE enumeration augmentation coupled with the pretrained knowledge is utilized to build an RNN-based prediction model, through which is boosted from 0.4446 to 0.9572. The comparable performance with the transfer learning strategy based on an existing big database (ChEMBL) to produce the energetic molecules and drug-like ones further supports the effectiveness and generality of our strategy in the low data regime. High-precision quantum mechanics calculations further confirm that 35 new molecules present a higher detonation velocity and lower synthetic accessibility than the classic explosive RDX, along with good thermal stability. In particular, three new molecules are comparable to caged CL-20 in the detonation velocity. All the source codes and the data set are freely available at https://github.com/wangchenghuidream/RNNMGM.
受深度学习在数据量少的情况下的挑战和对高能材料智能设计的迫切需求的启发,我们探索了一种相关的深度学习框架,该框架由三个通过迁移学习策略相关联的递归神经网络 (RNN) 组成,以在可用数据非常有限的情况下高效生成具有高爆速的新型高能分子。为了避免对外部大数据集的依赖,通过对 303 种高能化合物的碎片混排进行数据增强,产生了 50 万个分子用于 RNN 的预训练,通过该模型可以学习到足够的结构知识。然后,通过关注 303 种高能化合物对预训练的 RNN 进行微调,生成 7153 种与高能化合物相似的分子。为了更可靠地筛选具有高爆速的分子,我们利用 SMILE 枚举增强与预训练知识相结合,构建了一个基于 RNN 的预测模型,使准确性从 0.4446 提高到 0.9572。与基于现有大数据集 (ChEMBL) 的迁移学习策略在生成高能分子和类药分子方面的可比性能进一步支持了我们在数据量少的情况下的策略的有效性和通用性。高精度量子力学计算进一步证实,35 种新分子的爆速比经典炸药 RDX 更高,合成可及性更低,热稳定性更好。特别是,三种新分子的爆速与笼型 CL-20 相当。所有的源代码和数据集都可以在 https://github.com/wangchenghuidream/RNNMGM 上免费获取。