Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
J Chem Inf Model. 2024 Jul 22;64(14):5521-5534. doi: 10.1021/acs.jcim.4c00572. Epub 2024 Jul 1.
Information extraction from chemistry literature is vital for constructing up-to-date reaction databases for data-driven chemistry. Complete extraction requires combining information across text, tables, and figures, whereas prior work has mainly investigated extracting reactions from single modalities. In this paper, we present OpenChemIE to address this complex challenge and enable the extraction of reaction data at the document level. OpenChemIE approaches the problem in two steps: extracting relevant information from individual modalities and then integrating the results to obtain a final list of reactions. For the first step, we employ specialized neural models that each address a specific task for chemistry information extraction, such as parsing molecules or reactions from text or figures. We then integrate the information from these modules using chemistry-informed algorithms, allowing for the extraction of fine-grained reaction data from reaction condition and substrate scope investigations. Our machine learning models attain state-of-the-art performance when evaluated individually, and we meticulously annotate a challenging dataset of reaction schemes with R-groups to evaluate our pipeline as a whole, achieving an F1 score of 69.5%. Additionally, the reaction extraction results of OpenChemIE attain an accuracy score of 64.3% when directly compared against the Reaxys chemical database. OpenChemIE is most suited for information extraction on organic chemistry literature, where molecules are generally depicted as planar graphs or written in text and can be consolidated into a SMILES format. We provide OpenChemIE freely to the public as an open-source package, as well as through a web interface.
从化学文献中提取信息对于构建最新的基于数据的化学反应数据库至关重要。完整的提取需要结合文本、表格和图形中的信息,而之前的工作主要集中在从单一模态中提取反应。在本文中,我们提出了 OpenChemIE 来解决这个复杂的挑战,并能够在文档级别上提取反应数据。OpenChemIE 分两步解决这个问题:从各个模态中提取相关信息,然后整合结果以获得最终的反应列表。对于第一步,我们使用专门的神经模型,每个模型都针对化学信息提取的特定任务,例如从文本或图形中解析分子或反应。然后,我们使用化学启发式算法整合这些模块的信息,允许从反应条件和底物范围研究中提取精细的反应数据。我们的机器学习模型在单独评估时达到了最先进的性能,并且我们精心注释了一个具有 R 基团的反应方案的具有挑战性的数据集,以整体评估我们的管道,实现了 69.5%的 F1 分数。此外,当直接与 Reaxys 化学数据库进行比较时,OpenChemIE 的反应提取结果的准确率达到 64.3%。OpenChemIE 最适合于有机化学文献的信息提取,其中分子通常以平面图形表示或写在文本中,并可以合并为 SMILES 格式。我们作为一个开源软件包免费向公众提供 OpenChemIE,并通过一个网络界面提供。