State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, China.
Xiangya School of Pharmaceutical Sciences, Central South University, China.
Brief Bioinform. 2021 Nov 5;22(6). doi: 10.1093/bib/bbab152.
Accurate and efficient prediction of molecular properties is one of the fundamental issues in drug design and discovery pipelines. Traditional feature engineering-based approaches require extensive expertise in the feature design and selection process. With the development of artificial intelligence (AI) technologies, data-driven methods exhibit unparalleled advantages over the feature engineering-based methods in various domains. Nevertheless, when applied to molecular property prediction, AI models usually suffer from the scarcity of labeled data and show poor generalization ability.
In this study, we proposed molecular graph BERT (MG-BERT), which integrates the local message passing mechanism of graph neural networks (GNNs) into the powerful BERT model to facilitate learning from molecular graphs. Furthermore, an effective self-supervised learning strategy named masked atoms prediction was proposed to pretrain the MG-BERT model on a large amount of unlabeled data to mine context information in molecules. We found the MG-BERT model can generate context-sensitive atomic representations after pretraining and transfer the learned knowledge to the prediction of a variety of molecular properties. The experimental results show that the pretrained MG-BERT model with a little extra fine-tuning can consistently outperform the state-of-the-art methods on all 11 ADMET datasets. Moreover, the MG-BERT model leverages attention mechanisms to focus on atomic features essential to the target property, providing excellent interpretability for the trained model. The MG-BERT model does not require any hand-crafted feature as input and is more reliable due to its excellent interpretability, providing a novel framework to develop state-of-the-art models for a wide range of drug discovery tasks.
准确高效地预测分子性质是药物设计和发现管道中的基本问题之一。基于传统特征工程的方法需要在特征设计和选择过程中具备广泛的专业知识。随着人工智能 (AI) 技术的发展,数据驱动方法在各个领域都表现出了无与伦比的优势,超越了基于特征工程的方法。然而,当应用于分子性质预测时,AI 模型通常受到标记数据稀缺的困扰,表现出较差的泛化能力。
在本研究中,我们提出了分子图 BERT(MG-BERT),它将图神经网络 (GNN) 的局部消息传递机制集成到强大的 BERT 模型中,以促进从分子图中学习。此外,我们提出了一种有效的自监督学习策略,称为掩蔽原子预测,该策略可在大量未标记数据上对 MG-BERT 模型进行预训练,以挖掘分子中的上下文信息。我们发现,经过预训练后,MG-BERT 模型可以生成上下文敏感的原子表示,并将学习到的知识转移到多种分子性质的预测中。实验结果表明,经过少量额外微调的预训练 MG-BERT 模型在所有 11 个 ADMET 数据集上的表现均优于最先进的方法。此外,MG-BERT 模型利用注意力机制关注对目标性质至关重要的原子特征,为训练模型提供了出色的可解释性。MG-BERT 模型不需要任何手工制作的特征作为输入,由于其出色的可解释性,更加可靠,为开发广泛的药物发现任务的最先进模型提供了新的框架。