Allchemy, Inc., Highland, Indiana 46322, United States.
Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw 01-224, Poland.
J Am Chem Soc. 2022 Mar 23;144(11):4819-4827. doi: 10.1021/jacs.1c12005. Epub 2022 Mar 8.
Applications of machine learning (ML) to synthetic chemistry rely on the assumption that large numbers of literature-reported examples should enable construction of accurate and predictive models of chemical reactivity. This paper demonstrates that abundance of carefully curated literature data may be insufficient for this purpose. Using an example of Suzuki-Miyaura coupling with heterocyclic building blocks─and a carefully selected database of >10,000 literature examples─we show that ML models cannot offer any meaningful predictions of optimum reaction conditions, even if the search space is restricted to only solvents and bases. This result holds irrespective of the ML model applied (from simple feed-forward to state-of-the-art graph-convolution neural networks) or the representation to describe the reaction partners (various fingerprints, chemical descriptors, latent representations, etc.). In all cases, the ML methods fail to perform significantly better than naive assignments based on the sheer frequency of certain reaction conditions reported in the literature. These unsatisfactory results likely reflect subjective preferences of various chemists to use certain protocols, other biasing factors as mundane as availability of certain solvents/reagents, and/or a lack of negative data. These findings highlight the likely importance of systematically generating reliable and standardized data sets for algorithm training.
机器学习(ML)在合成化学中的应用依赖于这样一个假设,即大量文献报道的实例应该能够构建出准确和可预测的化学反应性模型。本文证明,大量精心编辑的文献数据可能不足以实现这一目标。本文以铃木-宫浦偶联反应与杂环砌块为例,使用一个精心挑选的包含>10000 个文献实例的数据库,我们表明,即使将搜索空间仅限制在溶剂和碱,ML 模型也无法对最佳反应条件提供任何有意义的预测。无论应用的 ML 模型(从简单的前馈神经网络到最先进的图卷积神经网络)如何,或者用于描述反应伙伴的表示方式(各种指纹、化学描述符、潜在表示等)如何,结果都是如此。在所有情况下,ML 方法的表现都不如基于文献中报道的某些反应条件的频率的简单赋值方法好。这些不理想的结果可能反映了不同化学家在使用某些方案时的主观偏好,以及其他像溶剂/试剂的可用性这样平凡的偏置因素,或者缺乏负面数据。这些发现突出了系统地生成可靠和标准化数据集用于算法训练的重要性。