Wen Mingjian, Blau Samuel M, Xie Xiaowei, Dwaraknath Shyam, Persson Kristin A
Energy Technologies Area, Lawrence Berkeley National Laboratory Berkeley CA 94720 USA.
College of Chemistry, University of California Berkeley CA 94720 USA.
Chem Sci. 2022 Jan 11;13(5):1446-1458. doi: 10.1039/d1sc06515g. eCollection 2022 Feb 2.
Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem-classifying reactions into distinct families-and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.
机器学习(ML)方法在加速化学空间探索并从数据中提取科学见解以变革化学发现方面具有巨大潜力。然而,现代化学反应ML模型,如基于图神经网络(GNN)的模型,必须在大量带标签的数据上进行训练,以避免过度拟合数据,从而导致低准确性和低可转移性。在这项工作中,我们提出了一种策略,利用无标签数据为少量带标签的化学反应数据学习准确的ML模型。我们专注于一个古老且突出的问题——将反应分类到不同的族中——并为此任务构建了一个GNN模型。我们首先使用无监督对比学习在无标签反应数据上对模型进行预训练,然后在少量带标签的反应上对其进行微调。对比预训练通过使一个反应的两个增强版本的表示彼此相似但与其他反应不同来进行学习。我们提出了保护反应中心的化学上一致的反应增强方法,并发现它们是模型从未标签数据中提取相关信息以辅助反应分类任务的关键。迁移学习得到的模型大幅优于从头开始训练的监督模型。此外,它始终比基于传统规则驱动反应指纹的模型表现更好,传统规则驱动反应指纹长期以来一直是小数据集的默认选择,同时也比基于从掩码语言建模导出的反应指纹的模型表现更好。除了反应分类,该策略的有效性还在回归数据集上进行了测试;学习到的基于GNN的反应指纹还可用于探索化学反应空间,我们通过查询相似反应进行了展示。该策略可轻松应用于其他预测反应问题,以揭示无标签数据在利用有限标签供应学习更好模型方面的作用。