Karthikeyan Muthukumarasamy, Vyas Renu
Chemical Engineering and Process Development (CEPD), CSIR-National Chemical Laboratory, Pashan Road, Pune, Maharastra 411008 India.
MIT School of Bioengineering Sciences and Research, ADT (Art, Design and Technology) University, Loni Kalbhor, Pune, Maharashtra 412201 India.
J Cheminform. 2016 Dec 29;8:73. doi: 10.1186/s13321-016-0175-x. eCollection 2016.
Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher's resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed a Java based application, namely ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically. The methodology has been demonstrated via several case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in the PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar.Graphical abstract.
对化学期刊的数字访问带来了大量的分子信息,这些信息现在以PDF格式保存在补充材料文件中。然而,通常从PDF文档格式中提取这些分子信息是一项艰巨的任务。在此,我们提出一种方法,用于从科学研究文章的支持信息中获取3D分子数据,这些信息通常可从出版商资源中获取。为了证明以快速有效的方式从PDF文件格式中提取真正可计算分子的可行性,我们开发了一个基于Java的应用程序,即ChemEngine。该程序识别补充数据中的文本模式,并生成可自动进行多种计算过程的标准分子结构数据(键矩阵、原子坐标)。通过对存储在补充信息文件中的不同格式坐标数据进行的几个案例研究,证明了该方法的有效性,其中ChemEngine选择性地获取原子坐标并将其高精度地解释为分子。通过计算单点能量,证明了提取的分子坐标数据的可重用性,这些能量与文章中提供的原始计算数据非常吻合。预计该方法将使从PDF格式的补充文件中的分子信息大规模转换为一组随时可计算的分子数据,以创建用于高级计算过程的自动化工作流程。软件以及源代码和说明可在https://sourceforge.net/projects/chemengine/files/?source=navbar获取。图形摘要。