Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
J Chem Inf Model. 2022 Jun 13;62(11):2713-2725. doi: 10.1021/acs.jcim.2c00495. Epub 2022 May 31.
Deep learning has been a prevalence in computational chemistry and widely implemented in molecular property predictions. Recently, self-supervised learning (SSL), especially contrastive learning (CL), has gathered growing attention for the potential to learn molecular representations that generalize to the gigantic chemical space. Unlike supervised learning, SSL can directly leverage large unlabeled data, which greatly reduces the effort to acquire molecular property labels through costly and time-consuming simulations or experiments. However, most molecular SSL methods borrow the insights from the machine learning community but neglect the unique cheminformatics (e.g., molecular fingerprints) and multilevel graphical structures (e.g., functional groups) of molecules. In this work, we propose iMolCLR, mprovement of ecular ontrastive earning of epresentations with graph neural networks (GNNs) in two aspects: (1) mitigating faulty negative contrastive instances via considering cheminformatics similarities between molecule pairs and (2) fragment-level contrasting between intramolecule and intermolecule substructures decomposed from molecules. Experiments have shown that the proposed strategies significantly improve the performance of GNN models on various challenging molecular property predictions. In comparison to the previous CL framework, iMolCLR demonstrates an averaged 1.2% improvement of ROC-AUC on eight classification benchmarks and an averaged 10.1% decrease of the error on six regression benchmarks. On most benchmarks, the generic GNN pretrained by iMolCLR rivals or even surpasses supervised learning models with sophisticated architectures and engineered features. Further investigations demonstrate that representations learned through iMolCLR intrinsically embed scaffolds and functional groups that can reason molecule similarities.
深度学习在计算化学中已经很流行,并广泛应用于分子性质预测。最近,自监督学习(SSL),尤其是对比学习(CL),由于其具有学习可以泛化到巨大化学空间的分子表示的潜力,因此受到越来越多的关注。与监督学习不同,SSL 可以直接利用大量未标记的数据,这大大减少了通过昂贵且耗时的模拟或实验来获取分子性质标签的工作量。然而,大多数分子 SSL 方法借鉴了机器学习社区的见解,但忽略了分子的独特化学信息学(例如分子指纹)和多层次图形结构(例如官能团)。在这项工作中,我们提出了 iMolCLR,它从两个方面改进了图神经网络(GNN)的分子对比学习表示:(1)通过考虑分子对之间的化学信息学相似性来减轻错误的负对比实例,(2)从分子中分解的分子内和分子间亚结构进行片段级对比。实验表明,所提出的策略显著提高了 GNN 模型在各种具有挑战性的分子性质预测上的性能。与之前的 CL 框架相比,iMolCLR 在八个分类基准上的 ROC-AUC 平均提高了 1.2%,在六个回归基准上的误差平均降低了 10.1%。在大多数基准上,通过 iMolCLR 预训练的通用 GNN 可以与具有复杂架构和工程化特征的监督学习模型相媲美,甚至超越后者。进一步的研究表明,通过 iMolCLR 学习的表示内在地嵌入了可以推理分子相似性的支架和官能团。