Jin Tianfan, Zhao Qiyuan, Schofield Andrew B, Savoie Brett M
Department of Chemical Engineering, Purdue University West Lafayette USA
Chem Sci. 2024 Jul 1;15(30):11995-12005. doi: 10.1039/d3sc04909d. eCollection 2024 Jul 31.
Deductive solution strategies are required in prediction scenarios that are under determined, when contradictory information is available, or more generally wherever one-to-many non-functional mappings occur. In contrast, most contemporary machine learning (ML) in the chemical sciences is inductive learning from example, with a fixed set of features. Chemical workflows are replete with situations requiring deduction, including many aspects of lab automation and spectral interpretation. Here, a general strategy is described for designing and training machine learning models capable of deduction that consists of combining individual inductive models into a larger deductive network. The training and testing of these models is demonstrated on the task of deducing reaction products from a mixture of spectral sources. The resulting models can distinguish between intended and unintended reaction outcomes and identify starting material based on a mixture of spectral sources. The models also perform well on tasks that they were not directly trained on, like performing structural inference using real rather than simulated spectral inputs, predicting minor products from named organic chemistry reactions, identifying reagents and isomers as plausible impurities, and handling missing or conflicting information. A new dataset of 1 124 043 simulated spectra that were generated to train these models is also distributed with this work. These findings demonstrate that deductive bottlenecks for chemical problems are not fundamentally insuperable for ML models.
在预测场景中,如果信息不完整、存在矛盾信息,或者更普遍地说,在出现一对多非功能映射的任何地方,都需要演绎求解策略。相比之下,化学科学中的大多数当代机器学习(ML)都是从示例中进行归纳学习,具有一组固定的特征。化学工作流程中充满了需要演绎的情况,包括实验室自动化和光谱解释的许多方面。在这里,描述了一种用于设计和训练能够进行演绎的机器学习模型的通用策略,该策略包括将单个归纳模型组合成一个更大的演绎网络。在从光谱源混合物中推断反应产物的任务上展示了这些模型的训练和测试。所得模型可以区分预期和非预期的反应结果,并根据光谱源混合物识别起始原料。这些模型在未直接训练的任务上也表现良好,例如使用真实而非模拟光谱输入进行结构推断、预测命名有机化学反应的次要产物、将试剂和异构体识别为可能的杂质以及处理缺失或冲突的信息。这项工作还分发了一个为训练这些模型而生成的包含1124043个模拟光谱的新数据集。这些发现表明,化学问题的演绎瓶颈对于ML模型来说并非根本无法克服。