School of Computer Science and Technology, East China Normal University, 200062 Shanghai, China.
Innovation Center for AI and Drug Discovery, East China Normal University, 200062 Shanghai, China.
Brief Bioinform. 2024 Sep 23;25(6). doi: 10.1093/bib/bbae565.
Substructure-based representation learning has emerged as a powerful approach to featurize complex attributed graphs, with promising results in molecular property prediction (MPP). However, existing MPP methods mainly rely on manually defined rules to extract substructures. It remains an open challenge to adaptively identify meaningful substructures from numerous molecular graphs to accommodate MPP tasks. To this end, this paper proposes Prototype-based cOntrastive Substructure IdentificaTion (POSIT), a self-supervised framework to autonomously discover substructural prototypes across graphs so as to guide end-to-end molecular fragmentation. During pre-training, POSIT emphasizes two key aspects of substructure identification: firstly, it imposes a soft connectivity constraint to encourage the generation of topologically meaningful substructures; secondly, it aligns resultant substructures with derived prototypes through a prototype-substructure contrastive clustering objective, ensuring attribute-based similarity within clusters. In the fine-tuning stage, a cross-scale attention mechanism is designed to integrate substructure-level information to enhance molecular representations. The effectiveness of the POSIT framework is demonstrated by experimental results from diverse real-world datasets, covering both classification and regression tasks. Moreover, visualization analysis validates the consistency of chemical priors with identified substructures. The source code is publicly available at https://github.com/VRPharmer/POSIT.
基于子结构的表示学习已成为一种强大的方法,可以对复杂的属性图进行特征化,在分子性质预测 (MPP) 方面取得了有希望的结果。然而,现有的 MPP 方法主要依赖于手动定义的规则来提取子结构。自适应地从众多分子图中识别有意义的子结构以适应 MPP 任务仍然是一个开放的挑战。为此,本文提出了基于原型的对比子结构识别 (POSIT),这是一种自监督框架,可以自主发现跨图的子结构原型,从而指导端到端的分子片段化。在预训练期间,POSIT 强调了子结构识别的两个关键方面:首先,它施加了软连接约束,以鼓励生成具有拓扑意义的子结构;其次,它通过原型-子结构对比聚类目标将生成的子结构与派生的原型对齐,确保簇内基于属性的相似性。在微调阶段,设计了一种跨尺度注意机制来整合子结构级别的信息,以增强分子表示。POSIT 框架的有效性通过来自各种真实世界数据集的实验结果得到了证明,涵盖了分类和回归任务。此外,可视化分析验证了所识别的子结构与化学先验的一致性。该代码可在 https://github.com/VRPharmer/POSIT 上获得。