Suppr超能文献

SMICLR:基于多种分子表示的对比学习用于半监督和无监督表示学习。

SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning.

机构信息

Institute of Science and Technology, Federal University of São Paulo (Unifesp), 12247-014, São José dos Campos, SP, Brazil.

São Carlos Institute of Chemistry, University of São Paulo, P.O. Box 780, 13560-970, São Carlos, SP, Brazil.

出版信息

J Chem Inf Model. 2022 Sep 12;62(17):3948-3960. doi: 10.1021/acs.jcim.2c00521. Epub 2022 Aug 31.

Abstract

Machine learning as a tool for chemical space exploration broadens horizons to work with known and unknown molecules. At its core lies molecular representation, an essential key to improve learning about structure-property relationships. Recently, contrastive frameworks have been showing impressive results for representation learning in diverse domains. Therefore, this paper proposes a contrastive framework that embraces multimodal molecular data. Specifically, our approach jointly trains a graph encoder and an encoder for the simplified molecular-input line-entry system (SMILES) string to perform the contrastive learning objective. Since SMILES is the basis of our method, i.e., we built the molecular graph from the SMILES, we call our framework as SMILES Contrastive Learning (SMICLR). When stacking a nonlinear regressor on the SMICLR's pretrained encoder and fine-tuning the entire model, we reduced the prediction error by, on average, 44% and 25% for the energetic and electronic properties of the QM9 data set, respectively, over the supervised baseline. We further improved our framework's performance when applying data augmentations in each molecular-input representation. Moreover, SMICLR demonstrated competitive representation learning results in an unsupervised setting.

摘要

机器学习作为一种探索化学空间的工具,拓宽了研究已知和未知分子的视野。其核心是分子表示,这是改进结构-性质关系学习的关键。最近,对比框架在不同领域的表示学习中取得了令人瞩目的成果。因此,本文提出了一种结合多模态分子数据的对比框架。具体来说,我们的方法联合训练了一个图编码器和一个简化分子输入行式系统(SMILES)字符串的编码器,以执行对比学习目标。由于 SMILES 是我们方法的基础,即我们从 SMILES 构建分子图,因此我们将我们的框架称为 SMILES 对比学习(SMICLR)。当在 SMICLR 的预训练编码器上堆叠一个非线性回归器并微调整个模型时,与监督基线相比,我们分别将 QM9 数据集的能量和电子性质的预测误差平均降低了 44%和 25%。当在每个分子输入表示中应用数据增强时,我们进一步提高了框架的性能。此外,SMICLR 在无监督设置中表现出了有竞争力的表示学习结果。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验