Rajan Kohulan, Brinkhaus Henning Otto, Sorokina Maria, Zielesny Achim, Steinbeck Christoph
Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany.
J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.
Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai , lets the user upload a pdf file and retrieve the segmented structure depictions.
化学领域回顾了科学文献中关于化合物及其结构和性质的数十年出版物。以(半)自动方式解放这些知识并使其在开放获取数据库中供全世界使用是当前的一项挑战。除了挖掘文本信息外,光学化学结构识别(OCSR),即将化学结构图像转换为机器可读表示形式,也是此工作流程的一部分。由于OCSR过程需要包含化学结构的图像,因此需要一个可公开获取的工具,该工具能够自动从科学出版物中识别和分割化学结构描绘。这对于仅以扫描页面形式提供的旧文档尤为重要。在此,我们展示了DECIMER(用于化学图像识别的深度学习)分割工具,这是首个基于深度学习的开源工具,用于从科学文献中自动识别和分割化学结构。该工作流程分为两个主要阶段。在检测步骤中,深度学习模型识别化学结构描绘并创建掩码,这些掩码定义了它们在输入页面上的位置。随后,在后期处理工作流程中扩展可能不完整的掩码。已在来自不同出版商的三组出版物上手动评估了DECIMER分割工具的性能。该方法对期刊页面的位图图像进行操作,以便也适用于PDF中引入矢量图像之前的旧文章。通过公开提供源代码和训练模型,我们希望为全面的化学数据提取工作流程的发展做出贡献。为了便于访问DECIMER分割工具,我们还开发了一个网络应用程序。该网络应用程序可在https://decimer.ai上获取,用户可以上传pdf文件并检索分割后的结构描绘。