LBM, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France.
PASTEUR, Département de Chimie, École Normale Supérieure, PSL University, Sorbonne Université, CNRS, 75005 Paris, France.
J Am Chem Soc. 2022 Aug 17;144(32):14722-14730. doi: 10.1021/jacs.2c05302. Epub 2022 Aug 8.
Synthetic yield prediction using machine learning is intensively studied. Previous work has focused on two categories of data sets: high-throughput experimentation data, as an ideal case study, and data sets extracted from proprietary databases, which are known to have a strong reporting bias toward high yields. However, predicting yields using published reaction data remains elusive. To fill the gap, we built a data set on nickel-catalyzed cross-couplings extracted from organic reaction publications, including scope and optimization information. We demonstrate the importance of including optimization data as a source of failed experiments and emphasize how publication constraints shape the exploration of the chemical space by the synthetic community. While machine learning models still fail to perform out-of-sample predictions, this work shows that adding chemical knowledge enables fair predictions in a low-data regime. Eventually, we hope that this unique public database will foster further improvements of machine learning methods for reaction yield prediction in a more realistic context.
使用机器学习进行合成产率预测是一个受到广泛研究的课题。先前的工作主要集中在两类数据集上:高通量实验数据,作为一个理想的案例研究,以及从专有数据库中提取的数据,这些数据已知存在强烈的高产率报告偏差。然而,使用已发表的反应数据来预测产率仍然难以实现。为了填补这一空白,我们构建了一个从有机反应文献中提取的镍催化交叉偶联反应数据集,其中包括反应范围和优化信息。我们证明了将优化数据作为失败实验的来源的重要性,并强调了出版限制如何影响合成社区对化学空间的探索。虽然机器学习模型仍然无法进行样本外预测,但这项工作表明,添加化学知识可以在数据量较少的情况下实现公平预测。最终,我们希望这个独特的公共数据库能够在更现实的背景下促进机器学习方法在反应产率预测方面的进一步改进。