DECIMER-分割：从科学文献中自动提取化学结构描绘。

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.

作者信息

Rajan Kohulan, Brinkhaus Henning Otto, Sorokina Maria, Zielesny Achim, Steinbeck Christoph

机构信息

Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.

Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany.

出版信息

J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.

DOI:10.1186/s13321-021-00496-1

PMID:33685498

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7941967/

Abstract

Chemistry looks back at many decades of publications on chemical compounds, their structures and properties, in scientific articles. Liberating this knowledge (semi-)automatically and making it available to the world in open-access databases is a current challenge. Apart from mining textual information, Optical Chemical Structure Recognition (OCSR), the translation of an image of a chemical structure into a machine-readable representation, is part of this workflow. As the OCSR process requires an image containing a chemical structure, there is a need for a publicly available tool that automatically recognizes and segments chemical structure depictions from scientific publications. This is especially important for older documents which are only available as scanned pages. Here, we present DECIMER (Deep lEarning for Chemical IMagE Recognition) Segmentation, the first open-source, deep learning-based tool for automated recognition and segmentation of chemical structures from the scientific literature. The workflow is divided into two main stages. During the detection step, a deep learning model recognizes chemical structure depictions and creates masks which define their positions on the input page. Subsequently, potentially incomplete masks are expanded in a post-processing workflow. The performance of DECIMER Segmentation has been manually evaluated on three sets of publications from different publishers. The approach operates on bitmap images of journal pages to be applicable also to older articles before the introduction of vector images in PDFs. By making the source code and the trained model publicly available, we hope to contribute to the development of comprehensive chemical data extraction workflows. In order to facilitate access to DECIMER Segmentation, we also developed a web application. The web application, available at https://decimer.ai , lets the user upload a pdf file and retrieve the segmented structure depictions.

摘要

化学领域回顾了科学文献中关于化合物及其结构和性质的数十年出版物。以（半）自动方式解放这些知识并使其在开放获取数据库中供全世界使用是当前的一项挑战。除了挖掘文本信息外，光学化学结构识别（OCSR），即将化学结构图像转换为机器可读表示形式，也是此工作流程的一部分。由于OCSR过程需要包含化学结构的图像，因此需要一个可公开获取的工具，该工具能够自动从科学出版物中识别和分割化学结构描绘。这对于仅以扫描页面形式提供的旧文档尤为重要。在此，我们展示了DECIMER（用于化学图像识别的深度学习）分割工具，这是首个基于深度学习的开源工具，用于从科学文献中自动识别和分割化学结构。该工作流程分为两个主要阶段。在检测步骤中，深度学习模型识别化学结构描绘并创建掩码，这些掩码定义了它们在输入页面上的位置。随后，在后期处理工作流程中扩展可能不完整的掩码。已在来自不同出版商的三组出版物上手动评估了DECIMER分割工具的性能。该方法对期刊页面的位图图像进行操作，以便也适用于PDF中引入矢量图像之前的旧文章。通过公开提供源代码和训练模型，我们希望为全面的化学数据提取工作流程的发展做出贡献。为了便于访问DECIMER分割工具，我们还开发了一个网络应用程序。该网络应用程序可在https://decimer.ai上获取，用户可以上传pdf文件并检索分割后的结构描绘。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0023/7941967/4c7d6794ece4/13321_2021_496_Fig1_HTML.jpg

相似文献

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.

J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.

DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.

Nat Commun. 2023 Aug 19;14(1):5045. doi: 10.1038/s41467-023-40782-0.

DECIMER 1.0: deep learning for chemical image recognition using transformers.

J Cheminform. 2021 Aug 17;13(1):61. doi: 10.1186/s13321-021-00538-8.

A review of optical chemical structure recognition tools.

J Cheminform. 2020 Oct 7;12(1):60. doi: 10.1186/s13321-020-00465-0.

DECIMER: towards deep learning for chemical image recognition.

J Cheminform. 2020 Oct 27;12(1):65. doi: 10.1186/s13321-020-00469-w.

Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture.

J Cheminform. 2024 Jul 5;16(1):78. doi: 10.1186/s13321-024-00872-7.

DECIMER-hand-drawn molecule images dataset.

J Cheminform. 2022 Jun 9;14(1):36. doi: 10.1186/s13321-022-00620-9.

Automated molecular structure segmentation from documents using ChemSAM.

J Cheminform. 2024 Mar 12;16(1):29. doi: 10.1186/s13321-024-00823-2.

YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications.

J Cheminform. 2023 Nov 20;15(1):111. doi: 10.1186/s13321-023-00783-z.

MolMiner: You Only Look Once for Chemical Structure Recognition.

J Chem Inf Model. 2022 Nov 28;62(22):5321-5328. doi: 10.1021/acs.jcim.2c00733. Epub 2022 Sep 15.

引用本文的文献

Role of Artificial Intelligence in Drug Discovery to Revolutionize the Pharmaceutical Industry: Resources, Methods and Applications.

Recent Pat Biotechnol. 2025;19(1):35-52. doi: 10.2174/0118722083297406240313090140.

Revealing Chemical Trends: Insights from Data-Driven Visualization and Patent Analysis in Exposomics Research.

Environ Sci Technol Lett. 2024 Aug 30;11(10):1046-1052. doi: 10.1021/acs.estlett.4c00560. eCollection 2024 Oct 8.

Automation and machine learning augmented by large language models in a catalysis study.

Chem Sci. 2024 Jun 26;15(31):12200-12233. doi: 10.1039/d3sc07012c. eCollection 2024 Aug 7.

PatCID: an open-access dataset of chemical structures in patent documents.

Nat Commun. 2024 Aug 2;15(1):6532. doi: 10.1038/s41467-024-50779-y.

Automated molecular structure segmentation from documents using ChemSAM.

J Cheminform. 2024 Mar 12;16(1):29. doi: 10.1186/s13321-024-00823-2.

YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications.

J Cheminform. 2023 Nov 20;15(1):111. doi: 10.1186/s13321-023-00783-z.

Cheminformatics Microservice: unifying access to open cheminformatics toolkits.

J Cheminform. 2023 Oct 16;15(1):98. doi: 10.1186/s13321-023-00762-4.

Artificial intelligence for natural product drug discovery.

Nat Rev Drug Discov. 2023 Nov;22(11):895-916. doi: 10.1038/s41573-023-00774-7. Epub 2023 Sep 11.

DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.

Nat Commun. 2023 Aug 19;14(1):5045. doi: 10.1038/s41467-023-40782-0.

Review of techniques and models used in optical chemical structure recognition in images and scanned documents.

J Cheminform. 2022 Sep 9;14(1):61. doi: 10.1186/s13321-022-00642-3.

本文引用的文献

A review of optical chemical structure recognition tools.

J Cheminform. 2020 Oct 7;12(1):60. doi: 10.1186/s13321-020-00465-0.

DECIMER: towards deep learning for chemical image recognition.

J Cheminform. 2020 Oct 27;12(1):65. doi: 10.1186/s13321-020-00469-w.

ChemSchematicResolver: A Toolkit to Decode 2D Chemical Diagrams with Labels and R-Groups into Annotated Chemical Named Entities.

J Chem Inf Model. 2020 Apr 27;60(4):2059-2072. doi: 10.1021/acs.jcim.0c00042. Epub 2020 Apr 7.

Molecular Structure Extraction from Documents Using Deep Learning.

J Chem Inf Model. 2019 Mar 25;59(3):1017-1029. doi: 10.1021/acs.jcim.8b00669. Epub 2019 Feb 27.

Information Retrieval and Text Mining Technologies for Chemistry.

Chem Rev. 2017 Jun 28;117(12):7673-7761. doi: 10.1021/acs.chemrev.6b00851. Epub 2017 May 5.

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.

J Chem Inf Model. 2016 Oct 24;56(10):1894-1904. doi: 10.1021/acs.jcim.6b00207. Epub 2016 Oct 6.

Open Data, Open Source and Open Standards in chemistry: The Blue Obelisk five years on.

J Cheminform. 2011 Oct 14;3(1):37. doi: 10.1186/1758-2946-3-37.

Optical structure recognition software to recover chemical information: OSRA, an open source solution.

J Chem Inf Model. 2009 Mar;49(3):740-3. doi: 10.1021/ci800067r.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

DECIMER-分割：从科学文献中自动提取化学结构描绘。

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.

作者信息

Rajan Kohulan, Brinkhaus Henning Otto, Sorokina Maria, Zielesny Achim, Steinbeck Christoph

机构信息

Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.

Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany.

出版信息

J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.

DOI:10.1186/s13321-021-00496-1

PMID:33685498

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7941967/

Abstract

摘要

DECIMER-分割：从科学文献中自动提取化学结构描绘。

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

DECIMER-分割：从科学文献中自动提取化学结构描绘。

DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献