Information School , University of Sheffield , Regent Court, 211 Portobello , Sheffield S1 4DP , United Kingdom.
Evotec (U.K.) Ltd. , 114 Innovation Drive , Milton Park, Abingdon OX14 4RZ , United Kingdom.
J Chem Inf Model. 2019 Oct 28;59(10):4167-4187. doi: 10.1021/acs.jcim.9b00537. Epub 2019 Sep 26.
Reaction classification has often been considered an important task for many different applications, and has traditionally been accomplished using hand-coded rule-based approaches. However, the availability of large collections of reactions enables data-driven approaches to be developed. We present the development and validation of a 336-class machine learning-based classification model integrated within a Conformal Prediction (CP) framework to associate reaction class predictions with confidence estimations. We also propose a data-driven approach for "dynamic" reaction fingerprinting to maximize the effectiveness of reaction encoding, as well as developing a novel reaction classification system that organizes labels into four hierarchical levels (SHREC: Sheffield Hierarchical REaction Classification). We show that the performance of the CP augmented model can be improved by defining confidence thresholds to detect predictions that are less likely to be false. For example, the external validation of the model reports 95% of predictions as correct by filtering out less than 15% of the uncertain classifications. The application of the model is demonstrated by classifying two reaction data sets: one extracted from an industrial ELN and the other from the medicinal chemistry literature. We show how confidence estimations and class compositions across different levels of information can be used to gain immediate insights on the nature of reaction collections and hidden relationships between reaction classes.
反应分类通常被认为是许多不同应用的重要任务,传统上使用基于规则的手工编码方法来完成。然而,大量反应的出现使得可以开发基于数据的方法。我们提出了一种基于机器学习的 336 类分类模型的开发和验证,该模型集成在一个共形预测 (CP) 框架内,以关联反应类别的预测和置信度估计。我们还提出了一种数据驱动的“动态”反应指纹识别方法,以最大限度地提高反应编码的有效性,并开发了一种新的反应分类系统,将标签组织成四个层次结构级别 (SHREC:谢菲尔德层次反应分类)。我们表明,通过定义置信度阈值来检测不太可能错误的预测,可以提高 CP 增强模型的性能。例如,该模型的外部验证通过过滤掉不到 15%的不确定分类,报告 95%的预测为正确。该模型的应用通过对两个反应数据集进行分类来演示:一个从工业 ELN 中提取,另一个从药物化学文献中提取。我们展示了如何使用置信度估计和不同信息级别下的类别组成来快速了解反应集合的性质和反应类别之间的隐藏关系。