Infinite Intelligence Pharma, Beijing, China 100083.
Center for Quantitative Biology, Peking University, Beijing, China 100871.
J Chem Inf Model. 2022 Nov 28;62(22):5321-5328. doi: 10.1021/acs.jcim.2c00733. Epub 2022 Sep 15.
Molecular structures are commonly depicted in 2D printed forms in scientific documents such as journal papers and patents. However, these 2D depictions are not machine readable. Due to a backlog of decades and an increasing amount of printed literatures, there is a high demand for translating printed depictions into machine-readable formats, which is known as Optical Chemical Structure Recognition (OCSR). Most OCSR systems developed over the last three decades use a rule-based approach, which vectorizes the depiction based on the interpretation of vectors and nodes as bonds and atoms. Here, we present a practical software called MolMiner, which is primarily built using deep neural networks originally developed for semantic segmentation and object detection to recognize atom and bond elements from documents. These recognized elements can be easily connected as a molecular graph with a distance-based construction algorithm. MolMiner gave state-of-the-art performance on four benchmark data sets and a self-collected external data set from scientific papers. As MolMiner performed similarly well in real-world OCSR tasks with a user-friendly interface, it is a useful and valuable tool for daily applications. The free download links of Mac and Windows versions are available at https://github.com/iipharma/pharmamind-molminer.
分子结构通常在科学文献(如期刊论文和专利)中以二维打印形式呈现。然而,这些二维描述不具有机器可读性。由于几十年的积压和不断增加的印刷文献数量,将印刷描述转换为机器可读格式的需求很高,这被称为光学化学结构识别(OCSR)。过去三十年来开发的大多数 OCSR 系统都使用基于规则的方法,该方法根据对矢量和节点作为键和原子的解释对描述进行矢量化。在这里,我们展示了一个名为 MolMiner 的实用软件,它主要使用最初为语义分割和对象检测开发的深度神经网络来识别文档中的原子和键元素。这些识别出的元素可以通过基于距离的构建算法轻松连接成分子图。MolMiner 在四个基准数据集和一个来自科学论文的自收集外部数据集上取得了最先进的性能。由于 MolMiner 在具有用户友好界面的真实 OCSR 任务中表现同样出色,因此它是日常应用的有用且有价值的工具。Mac 和 Windows 版本的免费下载链接可在 https://github.com/iipharma/pharmamind-molminer 上获得。