Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, 200438 Shanghai, China.
Department of Computer Science and Technology, Tongji University, 201804 Shanghai, China.
Brief Bioinform. 2023 Sep 20;24(5). doi: 10.1093/bib/bbad296.
Molecular property prediction (MPP) is a crucial and fundamental task for AI-aided drug discovery (AIDD). Recent studies have shown great promise of applying self-supervised learning (SSL) to producing molecular representations to cope with the widely-concerned data scarcity problem in AIDD. As some specific substructures of molecules play important roles in determining molecular properties, molecular representations learned by deep learning models are expected to attach more importance to such substructures implicitly or explicitly to achieve better predictive performance. However, few SSL pre-trained models for MPP in the literature have ever focused on such substructures. To challenge this situation, this paper presents a Chemistry-Aware Fragmentation for Effective MPP (CAFE-MPP in short) under the self-supervised contrastive learning framework. First, a novel fragment-based molecular graph (FMG) is designed to represent the topological relationship between chemistry-aware substructures that constitute a molecule. Then, with well-designed hard negative pairs, a is pre-trained on fragment-level by contrastive learning to extract representations for the nodes in FMGs. Finally, a Graphormer model is leveraged to produce molecular representations for MPP based on the embeddings of fragments. Experiments on 11 benchmark datasets show that the proposed CAFE-MPP method achieves state-of-the-art performance on 7 of the 11 datasets and the second-best performance on 3 datasets, compared with six remarkable self-supervised methods. Further investigations also demonstrate that CAFE-MPP can learn to embed molecules into representations implicitly containing the information of fragments highly correlated to molecular properties, and can alleviate the over-smoothing problem of graph neural networks.
分子性质预测(MPP)是人工智能辅助药物发现(AIDD)的关键和基础任务。最近的研究表明,应用自监督学习(SSL)来生成分子表示以应对 AIDD 中广泛关注的数据稀缺问题具有很大的前景。由于分子的某些特定亚结构在确定分子性质方面起着重要作用,因此深度学习模型学习到的分子表示有望隐含或显式地对这些亚结构赋予更多的重要性,从而实现更好的预测性能。然而,文献中很少有针对 MPP 的 SSL 预训练模型关注到这些亚结构。为了应对这种情况,本文在自监督对比学习框架下提出了一种基于化学感知的有效 MPP 的化学感知碎片(CAFE-MPP)。首先,设计了一种新颖的基于片段的分子图(FMG),以表示构成分子的化学感知亚结构之间的拓扑关系。然后,通过精心设计的硬负对,在片段级别上通过对比学习对 FMG 中的节点进行预训练,以提取表示。最后,利用 Graphormer 模型基于片段的嵌入来生成 MPP 的分子表示。在 11 个基准数据集上的实验表明,与六个卓越的自监督方法相比,所提出的 CAFE-MPP 方法在 11 个数据集的 7 个数据集上达到了最先进的性能,在 3 个数据集上达到了第二好的性能。进一步的研究还表明,CAFE-MPP 可以学习将分子隐式地嵌入到表示中,这些表示隐含了与分子性质高度相关的片段信息,并且可以缓解图神经网络的过平滑问题。