Rajan Kohulan, Zielesny Achim, Steinbeck Christoph
Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessingstr. 8, 07743, Jena, Germany.
Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665, Recklinghausen, Germany.
J Cheminform. 2020 Oct 27;12(1):65. doi: 10.1186/s13321-020-00469-w.
The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of Deep lEarning for Chemical ImagE Recognition (DECIMER), a deep learning method based on existing show-and-tell deep neural networks, which makes very few assumptions about the structure of the underlying problem. It translates a bitmap image of a molecule, as found in publications, into a SMILES. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are superior over SMILES and we have a preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggests that we might be able to achieve near-accurate prediction with 50 to 100 million training structures. This work is entirely based on open-source software and open data and is available to the general public for any purpose.
从文献中自动识别化学结构示意图是重新发现化学物质信息并将其提供给开放获取数据库的工作流程中不可或缺的一部分。在此,我们报告了我们在化学图像识别深度学习(DECIMER)开发中的初步发现,这是一种基于现有的展示与讲述深度神经网络的深度学习方法,对潜在问题的结构几乎没有假设。它将出版物中出现的分子位图图像转换为SMILES。此处报告的训练状态尚未达到现有传统方法的性能,但我们提供的证据表明,经过足够的训练时间,我们的方法将达到可比的检测能力。DECIMER的训练成功取决于输入数据表示:深度SMILES优于SMILES,并且我们有初步迹象表明,最近报道的SELFIES优于深度SMILES。将我们的结果外推到更大的训练数据规模表明,我们也许能够用5000万到1亿个训练结构实现近乎准确的预测。这项工作完全基于开源软件和开放数据,可供公众用于任何目的。