MPCD：一种通过整合通用知识和领域知识进行分子性质预测的多任务图变换器

MPCD: A Multitask Graph Transformer for Molecular Property Prediction by Integrating Common and Domain Knowledge.

作者信息

Yang Xixi, Duan Yanjing, Cheng Zhixiang, Li Kun, Liu Yuansheng, Zeng Xiangxiang, Cao Dongsheng

机构信息

College of Computer Science and Electronic Engineering, Hunan University, Changsha 410086, Hunan, China.

Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, China.

出版信息

J Med Chem. 2024 Dec 12;67(23):21303-21316. doi: 10.1021/acs.jmedchem.4c02193. Epub 2024 Dec 2.

DOI:10.1021/acs.jmedchem.4c02193

PMID:39620982

Abstract

Molecular property prediction with deep learning often employs self-supervised learning techniques to learn common knowledge through masked atom prediction. However, the common knowledge gained by masked atom prediction dramatically differs from the graph-level optimization objective of downstream tasks, which results in suboptimal problems. Particularly for properties with limited data, the failure to consider domain knowledge results in a direct search in an immense common space, rendering it infeasible to identify the global optimum. To address this, we propose MPCD, which enhances pretraining transferability by aligning the optimization objectives between pretraining and fine-tuning with domain knowledge. MPCD also leverages multitask learning to improve data utilization and model robustness. Technically, MPCD employs a relation-aware self-attention mechanism to capture molecules' local and global structures comprehensively. Extensive validation demonstrates that MPCD outperforms state-of-the-art methods for absorption, distribution, metabolism, excretion, and toxicity (ADMET) and physicochemical prediction across various data sizes.

摘要

利用深度学习进行分子性质预测通常采用自监督学习技术，通过掩码原子预测来学习通用知识。然而，通过掩码原子预测获得的通用知识与下游任务的图级优化目标有很大差异，这导致了次优问题。特别是对于数据有限的性质，由于未能考虑领域知识，导致在巨大的通用空间中进行直接搜索，从而难以确定全局最优解。为了解决这个问题，我们提出了MPCD，它通过将预训练和微调之间的优化目标与领域知识对齐来提高预训练的可迁移性。MPCD还利用多任务学习来提高数据利用率和模型鲁棒性。从技术上讲，MPCD采用了关系感知自注意力机制，以全面捕捉分子的局部和全局结构。广泛的验证表明，MPCD在各种数据规模下的吸收、分布、代谢、排泄和毒性（ADMET）以及物理化学预测方面优于现有方法。