Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2023 Jul 10;63(13):4030-4041. doi: 10.1021/acs.jcim.3c00439. Epub 2023 Jun 27.
Reaction diagram parsing is the task of extracting reaction schemes from a diagram in the chemistry literature. The reaction diagrams can be arbitrarily complex; thus, robustly parsing them into structured data is an open challenge. In this paper, we present RxnScribe, a machine learning model for parsing reaction diagrams of varying styles. We formulate this structured prediction task with a sequence generation approach, which condenses the traditional pipeline into an end-to-end model. We train RxnScribe on a dataset of 1378 diagrams and evaluate it with cross validation, achieving an 80.0% soft match F1 score, with significant improvements over previous models. Our code and data are publicly available at https://github.com/thomas0809/RxnScribe.
反应图解析是从化学文献中的图表中提取反应方案的任务。反应图可以是任意复杂的;因此,将它们稳健地解析为结构化数据是一个开放的挑战。在本文中,我们提出了 RxnScribe,这是一种用于解析不同风格的反应图的机器学习模型。我们使用序列生成方法来制定这个结构化预测任务,将传统的流水线压缩为一个端到端的模型。我们在一个包含 1378 个图表的数据集上训练 RxnScribe,并通过交叉验证进行评估,得到了 80.0%的软匹配 F1 分数,明显优于以前的模型。我们的代码和数据可在 https://github.com/thomas0809/RxnScribe 上获得。