Viet Johansson Simon, Gummesson Svensson Hampus, Bjerrum Esben, Schliep Alexander, Haghir Chehreghani Morteza, Tyrchan Christian, Engkvist Ola
Molecular AI, Discovery Sciences, R&D, AstraZeneca, SE-431 83, Mölndal, Sweden.
Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg, SE-412 96, Göteborg, Sweden.
Mol Inform. 2022 Dec;41(12):e2200043. doi: 10.1002/minf.202200043. Epub 2022 Jul 14.
Computer aided synthesis planning, suggesting synthetic routes for molecules of interest, is a rapidly growing field. The machine learning methods used are often dependent on access to large datasets for training, but finite experimental budgets limit how much data can be obtained from experiments. This suggests the use of schemes for data collection such as active learning, which identifies the data points of highest impact for model accuracy, and which has been used in recent studies with success. However, little has been done to explore the robustness of the methods predicting reaction yield when used together with active learning to reduce the amount of experimental data needed for training. This study aims to investigate the influence of machine learning algorithms and the number of initial data points on reaction yield prediction for two public high-throughput experimentation datasets. Our results show that active learning based on output margin reached a pre-defined AUROC faster than random sampling on both datasets. Analysis of feature importance of the trained machine learning models suggests active learning had a larger influence on the model accuracy when only a few features were important for the model prediction.
计算机辅助合成规划,即针对感兴趣的分子提出合成路线,是一个快速发展的领域。所使用的机器学习方法通常依赖于获取大量数据集进行训练,但有限的实验预算限制了从实验中可获得的数据量。这表明应使用诸如主动学习之类的数据收集方案,主动学习可识别对模型准确性影响最大的数据点,并且已在近期研究中成功应用。然而,对于将预测反应产率的方法与主动学习结合使用以减少训练所需的实验数据量时这些方法的稳健性,几乎没有进行过探索。本研究旨在调查机器学习算法和初始数据点数量对两个公共高通量实验数据集的反应产率预测的影响。我们的结果表明,基于输出裕度的主动学习比两个数据集上的随机抽样更快地达到预定义的曲线下面积(AUROC)。对训练后的机器学习模型的特征重要性分析表明,当只有少数特征对模型预测很重要时,主动学习对模型准确性有更大影响。