Recursion, Salt Lake City, UT, USA.
Novo Nordisk Research Center, Lexington, MA, USA.
Nat Commun. 2024 Nov 12;15(1):9431. doi: 10.1038/s41467-024-53751-y.
Models that accurately predict properties based on chemical structure are valuable tools in the chemical sciences. However, for many properties, public and private training sets are typically small, making it difficult for models to generalize well outside of the training data. Recently, this lack of generalization has been mitigated by using self-supervised pretraining on large unlabeled datasets, followed by finetuning on smaller, labeled datasets. Inspired by these advances, we report MolE, a Transformer architecture adapted for molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures trained on ~842 million molecular graphs, and the second step is a massive multi-task approach to learn biological information. We show that finetuning models that were pretrained in this way perform better than the best published results on 10 of the 22 ADMET (absorption, distribution, metabolism, excretion and toxicity) tasks included in the Therapeutic Data Commons leaderboard (c. September 2023).
基于化学结构准确预测性质的模型是化学科学中非常有价值的工具。然而,对于许多性质来说,公共和私人的训练集通常很小,使得模型很难在训练数据之外很好地推广。最近,通过在大型无标签数据集上使用自监督预训练,并在较小的有标签数据集上进行微调,这种缺乏泛化的情况得到了缓解。受这些进展的启发,我们报告了 MolE,这是一种针对分子图的 Transformer 架构,以及一种两步预训练策略。预训练的第一步是一种专注于学习化学结构的自监督方法,在大约 8.42 亿个分子图上进行训练,第二步是一种大规模多任务方法,用于学习生物学信息。我们表明,以这种方式进行预训练的微调模型在 Therapeutic Data Commons 排行榜(c.2023 年 9 月)上包含的 22 项 ADMET(吸收、分布、代谢、排泄和毒性)任务中的 10 项上的表现优于已发表的最佳结果。