Zhou Chong, Liu Wei, Song Xiyue, Yang Mengling, Peng Xiaowang
School of Informatics, Hunan University of Chinese Medicine, Changsha, 410208, Hunan, People's Republic of China.
J Cheminform. 2023 Nov 20;15(1):111. doi: 10.1186/s13321-023-00783-z.
In chemistry-related disciplines, a vast repository of molecular structural data has been documented in scientific publications but remains inaccessible to computational analyses owing to its non-machine-readable format. Optical chemical structure recognition (OCSR) addresses this gap by converting images of chemical molecular structures into a format accessible to computers and convenient for storage, paving the way for further analyses and studies on chemical information. A pivotal initial step in OCSR is automating the noise-free extraction of molecular descriptions from literature. Despite efforts utilising rule-based and deep learning approaches for the extraction process, the accuracy achieved to date is unsatisfactory. To address this issue, we introduce a deep learning model named YoDe-Segmentation in this study, engineered for the automated retrieval of molecular structures from scientific documents. This model operates via a three-stage process encompassing detection, mask generation, and calculation. Initially, it identifies and isolates molecular structures during the detection phase. Subsequently, mask maps are created based on these isolated structures in the mask generation stage. In the final calculation stage, refined and separated mask maps are combined with the isolated molecular structure images, resulting in the acquisition of pure molecular structures. Our model underwent rigorous testing using texts from multiple chemistry-centric journals, with the outcomes subjected to manual validation. The results revealed the superior performance of YoDe-Segmentation compared to alternative algorithms, documenting an average extraction efficiency of 97.62%. This outcome not only highlights the robustness and reliability of the model but also suggests its applicability on a broad scale.
在化学相关学科中,科学出版物中记录了大量分子结构数据,但由于其非机器可读格式,这些数据仍无法用于计算分析。光学化学结构识别(OCSR)通过将化学分子结构图像转换为计算机可访问且便于存储的格式来填补这一空白,为进一步分析和研究化学信息铺平了道路。OCSR的一个关键初始步骤是自动从文献中无噪声地提取分子描述。尽管人们努力使用基于规则和深度学习的方法进行提取过程,但迄今为止所达到的准确率并不令人满意。为了解决这个问题,我们在本研究中引入了一种名为YoDe-Segmentation的深度学习模型,该模型旨在从科学文献中自动检索分子结构。该模型通过一个包含检测、掩码生成和计算的三阶段过程运行。最初,它在检测阶段识别并分离分子结构。随后,在掩码生成阶段根据这些分离的结构创建掩码图。在最后的计算阶段,经过细化和分离的掩码图与分离的分子结构图像相结合,从而获得纯净的分子结构。我们的模型使用来自多个以化学为中心的期刊的文本进行了严格测试,结果经过人工验证。结果表明,YoDe-Segmentation的性能优于其他算法,平均提取效率为97.62%。这一结果不仅突出了该模型的稳健性和可靠性,还表明了其广泛的适用性。