Karpov Pavel, Godin Guillaume, Tetko Igor V
Institute of Structural Biology, Helmholtz Zentrum München-Research Center for Environmental Health (GmbH), Ingolstädter Landstraße 1, 85764, Neuherberg, Germany.
BIGCHEM GmbH, Ingolstädter Landstraße 1, 85764, Neuherberg, Germany.
J Cheminform. 2020 Mar 18;12(1):17. doi: 10.1186/s13321-020-00423-w.
We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in higher quality interpretable QSAR/QSPR models on diverse benchmark datasets including regression and classification tasks. The proposed Transformer-CNN method uses SMILES augmentation for training and inference, and thus the prognosis is based on an internal consensus. That both the augmentation and transfer learning are based on embeddings allows the method to provide good results for small datasets. We discuss the reasons for such effectiveness and draft future directions for the development of the method. The source code and the embeddings needed to train a QSAR model are available on https://github.com/bigchem/transformer-cnn. The repository also has a standalone program for QSAR prognosis which calculates individual atoms contributions, thus interpreting the model's result. OCHEM [3] environment (https://ochem.eu) hosts the on-line implementation of the method proposed.
我们展示了从Transformer[1]模型的内部编码器状态派生的SMILES嵌入,该模型经过训练将SMILES规范化为一个序列到序列的问题。在这些嵌入上使用字符神经网络(CharNN)[2]架构,在包括回归和分类任务在内的各种基准数据集上能得到更高质量的可解释的定量构效关系/定量结构-性质关系(QSAR/QSPR)模型。所提出的Transformer-CNN方法在训练和推理时使用SMILES增强,因此预测基于内部一致性。增强和迁移学习都基于嵌入,这使得该方法在小数据集上也能取得良好的结果。我们讨论了这种有效性的原因,并为该方法的未来发展方向制定了草案。用于训练QSAR模型的源代码和嵌入可在https://github.com/bigchem/transformer-cnn上获取。该存储库还有一个用于QSAR预测的独立程序,它可以计算单个原子的贡献,从而解释模型的结果。OCHEM[3]环境(https://ochem.eu)托管了所提出方法的在线实现。