Suppr超能文献

预测产率时平衡模型敏感性和稳健性的挑战:酰胺偶联反应的基准研究

The challenge of balancing model sensitivity and robustness in predicting yields: a benchmarking study of amide coupling reactions.

作者信息

Liu Zhen, Moroz Yurii S, Isayev Olexandr

机构信息

Department of Chemistry, Mellon College of Science, Carnegie Mellon University Pittsburgh PA 15213 USA

Enamine Ltd Kyïv 02660 Ukraine.

出版信息

Chem Sci. 2023 Sep 13;14(39):10835-10846. doi: 10.1039/d3sc03902a. eCollection 2023 Oct 11.

Abstract

Accurate prediction of reaction yield is the holy grail for computer-assisted synthesis prediction, but current models have failed to generalize to large literature datasets. To understand the causes and inspire future design, we systematically benchmarked the yield prediction task. We carefully curated and augmented a literature dataset of 41 239 amide coupling reactions, each with information on reactants, products, intermediates, yields, and reaction contexts, and provided 3D structures for the molecules. We calculated molecular features related to 2D and 3D structure information, as well as physical and electronic properties. These descriptors were paired with 4 categories of machine learning methods (linear, kernel, ensemble, and neural network), yielding valuable benchmarks about feature and model performance. Despite the excellent performance on a high-throughput experiment (HTE) dataset ( around 0.9), no method gave satisfactory results on the literature data. The best performance was an of 0.395 ± 0.020 using the stack technique. Error analysis revealed that and are among the main reasons for incorrect predictions. Removing reactivity cliffs and uncertain reactions boosted the to 0.457 ± 0.006. These results highlight that yield prediction models must be sensitive to the reactivity change due to the subtle structure variance, as well as be robust to the uncertainty associated with yield measurements.

摘要

准确预测反应产率是计算机辅助合成预测的圣杯,但目前的模型未能推广到大型文献数据集。为了理解其原因并启发未来的设计,我们系统地对产率预测任务进行了基准测试。我们精心策划并扩充了一个包含41239个酰胺偶联反应的文献数据集,每个反应都包含反应物、产物、中间体、产率和反应背景信息,并提供了分子的三维结构。我们计算了与二维和三维结构信息以及物理和电子性质相关的分子特征。这些描述符与四类机器学习方法(线性、核、集成和神经网络)配对,得出了关于特征和模型性能的有价值的基准。尽管在高通量实验(HTE)数据集上表现出色(约为0.9),但没有一种方法在文献数据上给出令人满意的结果。使用堆叠技术的最佳性能是相关系数为0.395±0.020。误差分析表明,反应活性悬崖和不确定反应是预测错误的主要原因。去除反应活性悬崖和不确定反应后,相关系数提高到0.457±0.006。这些结果表明,产率预测模型必须对由于细微结构差异导致的反应活性变化敏感,并且对与产率测量相关的不确定性具有鲁棒性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5541/10566507/8cb487dc4d73/d3sc03902a-f1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验