• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

图像和扫描文档中光学化学结构识别所使用的技术与模型综述。

Review of techniques and models used in optical chemical structure recognition in images and scanned documents.

作者信息

Musazade Fidan, Jamalova Narmin, Hasanov Jamaladdin

机构信息

School of Engineering and Applied Science, The George Washington University, Washington, DC, United States.

School of IT and Engineering, ADA University, Baku, Azerbaijan.

出版信息

J Cheminform. 2022 Sep 9;14(1):61. doi: 10.1186/s13321-022-00642-3.

DOI:10.1186/s13321-022-00642-3
PMID:36076301
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9461257/
Abstract

Extraction of chemical formulas from images was not in the top priority of Computer Vision tasks for a while. The complexity both on the input and prediction sides has made this task challenging for the conventional Artificial Intelligence and Machine Learning problems. A binary input image which might seem trivial for convolutional analysis was not easy to classify, since the provided sample was not representative of the given molecule: to describe the same formula, a variety of graphical representations which do not resemble each other can be used. Considering the variety of molecules, the problem shifted from classification to that of formula generation, which makes Natural Language Processing (NLP) a good candidate for an effective solution. This paper describes the evolution of approaches from rule-based structure analyses to complex statistical models, and compares the efficiency of models and methodologies used in the recent years. Although the latest achievements deliver ideal results on particular datasets, the authors mention possible problems for various scenarios and provide suggestions for further development.

摘要

一段时间以来,从图像中提取化学式并非计算机视觉任务的首要优先级。输入和预测方面的复杂性使得这项任务对于传统的人工智能和机器学习问题而言具有挑战性。对于卷积分析来说看似简单的二进制输入图像却不易分类,因为所提供的样本并不代表给定的分子:为了描述同一个化学式,可以使用各种彼此并不相似的图形表示。考虑到分子的多样性,问题从分类转变为化学式生成问题,这使得自然语言处理(NLP)成为有效解决方案的一个不错选择。本文描述了从基于规则的结构分析到复杂统计模型的方法演变,并比较了近年来使用的模型和方法的效率。尽管最新成果在特定数据集上取得了理想结果,但作者提到了各种场景下可能存在的问题,并为进一步发展提供了建议。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/a9ebb3cf6a45/13321_2022_642_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/c73f4cbc8e52/13321_2022_642_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/c78efbe5baab/13321_2022_642_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/2eed82e0faad/13321_2022_642_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/ccb60e3aaa19/13321_2022_642_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/7b6673cde04b/13321_2022_642_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/fe7be87e32e0/13321_2022_642_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/63443c5a7204/13321_2022_642_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/2c3ffba8521c/13321_2022_642_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/38a330e88fd4/13321_2022_642_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/a9ebb3cf6a45/13321_2022_642_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/c73f4cbc8e52/13321_2022_642_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/c78efbe5baab/13321_2022_642_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/2eed82e0faad/13321_2022_642_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/ccb60e3aaa19/13321_2022_642_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/7b6673cde04b/13321_2022_642_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/fe7be87e32e0/13321_2022_642_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/63443c5a7204/13321_2022_642_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/2c3ffba8521c/13321_2022_642_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/38a330e88fd4/13321_2022_642_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/15b9/9461257/a9ebb3cf6a45/13321_2022_642_Fig10_HTML.jpg

相似文献

1
Review of techniques and models used in optical chemical structure recognition in images and scanned documents.图像和扫描文档中光学化学结构识别所使用的技术与模型综述。
J Cheminform. 2022 Sep 9;14(1):61. doi: 10.1186/s13321-022-00642-3.
2
Performance of a Computational Model of the Mammalian Olfactory System哺乳动物嗅觉系统计算模型的性能
3
Basic Artificial Intelligence Techniques: Natural Language Processing of Radiology Reports.基础人工智能技术:放射学报告的自然语言处理。
Radiol Clin North Am. 2021 Nov;59(6):919-931. doi: 10.1016/j.rcl.2021.06.003.
4
Application of Transformers in Cheminformatics.Transformer 在化学信息学中的应用。
J Chem Inf Model. 2024 Jun 10;64(11):4392-4409. doi: 10.1021/acs.jcim.3c02070. Epub 2024 May 30.
5
A comparative study on deep learning models for text classification of unstructured medical notes with various levels of class imbalance.深度学习模型在不同类别不平衡程度的非结构化医疗记录文本分类中的对比研究。
BMC Med Res Methodol. 2022 Jul 2;22(1):181. doi: 10.1186/s12874-022-01665-y.
6
Family History Extraction From Synthetic Clinical Narratives Using Natural Language Processing: Overview and Evaluation of a Challenge Data Set and Solutions for the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing (OHNLP) Competition.利用自然语言处理从合成临床叙述中提取家族病史:2019年国家自然语言处理临床挑战(n2c2)/开放健康自然语言处理(OHNLP)竞赛的挑战数据集概述与评估及解决方案
JMIR Med Inform. 2021 Jan 27;9(1):e24008. doi: 10.2196/24008.
7
Transformers-sklearn: a toolkit for medical language understanding with transformer-based models.Transformer-sklearn:一个基于 Transformer 的模型的医学语言理解工具包。
BMC Med Inform Decis Mak. 2021 Jul 30;21(Suppl 2):90. doi: 10.1186/s12911-021-01459-0.
8
Deep learning-based NLP data pipeline for EHR-scanned document information extraction.用于电子健康记录扫描文档信息提取的基于深度学习的自然语言处理数据管道。
JAMIA Open. 2022 Jun 11;5(2):ooac045. doi: 10.1093/jamiaopen/ooac045. eCollection 2022 Jul.
9
Extracting comprehensive clinical information for breast cancer using deep learning methods.利用深度学习方法提取乳腺癌全面临床信息。
Int J Med Inform. 2019 Dec;132:103985. doi: 10.1016/j.ijmedinf.2019.103985. Epub 2019 Oct 2.
10
RT-ViT: Real-Time Monocular Depth Estimation Using Lightweight Vision Transformers.RT-ViT:基于轻量级视觉Transformer 的实时单目深度估计。
Sensors (Basel). 2022 May 19;22(10):3849. doi: 10.3390/s22103849.

引用本文的文献

1
A review of transformer models in drug discovery and beyond.药物发现及其他领域中变压器模型综述。
J Pharm Anal. 2025 Jun;15(6):101081. doi: 10.1016/j.jpha.2024.101081. Epub 2024 Aug 30.
2
Automation and machine learning augmented by large language models in a catalysis study.在一项催化研究中,由大语言模型增强的自动化和机器学习。
Chem Sci. 2024 Jun 26;15(31):12200-12233. doi: 10.1039/d3sc07012c. eCollection 2024 Aug 7.
3
ChemReco: automated recognition of hand-drawn carbon-hydrogen-oxygen structures using deep learning.ChemReco:利用深度学习对手绘碳氢氧结构进行自动识别

本文引用的文献

1
Img2Mol - accurate SMILES recognition from molecular graphical depictions.Img2Mol - 从分子图形描绘中准确识别SMILES
Chem Sci. 2021 Sep 29;12(42):14174-14181. doi: 10.1039/d1sc01839f. eCollection 2021 Nov 3.
2
DECIMER 1.0: deep learning for chemical image recognition using transformers.DECIMER 1.0:使用Transformer进行化学图像识别的深度学习
J Cheminform. 2021 Aug 17;13(1):61. doi: 10.1186/s13321-021-00538-8.
3
Levenshtein Distance, Sequence Comparison and Biological Database Search.莱文斯坦距离、序列比较与生物数据库搜索。
Sci Rep. 2024 Jul 25;14(1):17126. doi: 10.1038/s41598-024-67496-7.
4
Advancements in hand-drawn chemical structure recognition through an enhanced DECIMER architecture.通过增强的DECIMER架构实现手绘化学结构识别的进展。
J Cheminform. 2024 Jul 5;16(1):78. doi: 10.1186/s13321-024-00872-7.
5
DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.DECIMER.ai:一个用于科学出版物中光学化学结构自动识别、分割和识别的开放平台。
Nat Commun. 2023 Aug 19;14(1):5045. doi: 10.1038/s41467-023-40782-0.
IEEE Trans Inf Theory. 2021 Jun;67(6):3287-3294. doi: 10.1109/tit.2020.2996543. Epub 2020 May 21.
4
DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.DECIMER-分割:从科学文献中自动提取化学结构描绘。
J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.
5
A review of optical chemical structure recognition tools.光学化学结构识别工具综述。
J Cheminform. 2020 Oct 7;12(1):60. doi: 10.1186/s13321-020-00465-0.
6
DECIMER: towards deep learning for chemical image recognition.DECIMER:迈向用于化学图像识别的深度学习
J Cheminform. 2020 Oct 27;12(1):65. doi: 10.1186/s13321-020-00469-w.
7
ChemSchematicResolver: A Toolkit to Decode 2D Chemical Diagrams with Labels and R-Groups into Annotated Chemical Named Entities.ChemSchematicResolver:一种将带标签和 R 基团的 2D 化学图表解码为带注释的化学命名实体的工具包。
J Chem Inf Model. 2020 Apr 27;60(4):2059-2072. doi: 10.1021/acs.jcim.0c00042. Epub 2020 Apr 7.
8
Molecular Structure Extraction from Documents Using Deep Learning.使用深度学习从文档中提取分子结构。
J Chem Inf Model. 2019 Mar 25;59(3):1017-1029. doi: 10.1021/acs.jcim.8b00669. Epub 2019 Feb 27.
9
Jmol SMILES and Jmol SMARTS: specifications and applications.Jmol SMILES和Jmol SMARTS:规范与应用。
J Cheminform. 2016 Sep 26;8:50. doi: 10.1186/s13321-016-0160-4. eCollection 2016.
10
Circular sequence comparison: algorithms and applications.循环序列比较:算法与应用
Algorithms Mol Biol. 2016 May 10;11:12. doi: 10.1186/s13015-016-0076-6. eCollection 2016.