Xiangya School of Pharmaceutical Sciences, Central South University, Changsha Hunan 410013, P. R. China.
College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410013, P. R. China.
J Med Chem. 2024 Jun 13;67(11):9575-9586. doi: 10.1021/acs.jmedchem.4c00692. Epub 2024 May 15.
Precisely predicting molecular properties is crucial in drug discovery, but the scarcity of labeled data poses a challenge for applying deep learning methods. While large-scale self-supervised pretraining has proven an effective solution, it often neglects domain-specific knowledge. To tackle this issue, we introduce Task-Oriented Multilevel Learning based on BERT (TOML-BERT), a dual-level pretraining framework that considers both structural patterns and domain knowledge of molecules. TOML-BERT achieved state-of-the-art prediction performance on 10 pharmaceutical datasets. It has the capability to mine contextual information within molecular structures and extract domain knowledge from massive pseudo-labeled data. The dual-level pretraining accomplished significant positive transfer, with its two components making complementary contributions. Interpretive analysis elucidated that the effectiveness of the dual-level pretraining lies in the prior learning of a task-related molecular representation. Overall, TOML-BERT demonstrates the potential of combining multiple pretraining tasks to extract task-oriented knowledge, advancing molecular property prediction in drug discovery.
精确预测分子性质在药物发现中至关重要,但标记数据的稀缺性给应用深度学习方法带来了挑战。虽然大规模的自监督预训练已被证明是一种有效的解决方案,但它往往忽略了领域特定的知识。为了解决这个问题,我们引入了基于 BERT 的面向任务的多层次学习(TOML-BERT),这是一个双重层次的预训练框架,考虑了分子的结构模式和领域知识。TOML-BERT 在 10 个制药数据集上实现了最先进的预测性能。它具有挖掘分子结构内的上下文信息和从大量伪标记数据中提取领域知识的能力。双重层次的预训练实现了显著的正迁移,其两个组成部分做出了互补的贡献。解释性分析表明,双重层次预训练的有效性在于与任务相关的分子表示的先验学习。总的来说,TOML-BERT 展示了结合多个预训练任务来提取面向任务的知识的潜力,从而推进了药物发现中的分子性质预测。