• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

化学引擎:从PDF文件中提取补充数据的三维化学结构

ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.

作者信息

Karthikeyan Muthukumarasamy, Vyas Renu

机构信息

Chemical Engineering and Process Development (CEPD), CSIR-National Chemical Laboratory, Pashan Road, Pune, Maharastra 411008 India.

MIT School of Bioengineering Sciences and Research, ADT (Art, Design and Technology) University, Loni Kalbhor, Pune, Maharashtra 412201 India.

出版信息

J Cheminform. 2016 Dec 29;8:73. doi: 10.1186/s13321-016-0175-x. eCollection 2016.

DOI:10.1186/s13321-016-0175-x
PMID:28090216
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5195924/
Abstract

Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher's resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed a Java based application, namely ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically. The methodology has been demonstrated via several case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in the PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software along with source codes and instructions available at https://sourceforge.net/projects/chemengine/files/?source=navbar.Graphical abstract.

摘要

对化学期刊的数字访问带来了大量的分子信息,这些信息现在以PDF格式保存在补充材料文件中。然而,通常从PDF文档格式中提取这些分子信息是一项艰巨的任务。在此,我们提出一种方法,用于从科学研究文章的支持信息中获取3D分子数据,这些信息通常可从出版商资源中获取。为了证明以快速有效的方式从PDF文件格式中提取真正可计算分子的可行性,我们开发了一个基于Java的应用程序,即ChemEngine。该程序识别补充数据中的文本模式,并生成可自动进行多种计算过程的标准分子结构数据(键矩阵、原子坐标)。通过对存储在补充信息文件中的不同格式坐标数据进行的几个案例研究,证明了该方法的有效性,其中ChemEngine选择性地获取原子坐标并将其高精度地解释为分子。通过计算单点能量,证明了提取的分子坐标数据的可重用性,这些能量与文章中提供的原始计算数据非常吻合。预计该方法将使从PDF格式的补充文件中的分子信息大规模转换为一组随时可计算的分子数据,以创建用于高级计算过程的自动化工作流程。软件以及源代码和说明可在https://sourceforge.net/projects/chemengine/files/?source=navbar获取。图形摘要。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/253a18745f48/13321_2016_175_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/17c6e9a93307/13321_2016_175_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/31d4eb904ada/13321_2016_175_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/5a39d007630a/13321_2016_175_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/253a18745f48/13321_2016_175_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/17c6e9a93307/13321_2016_175_Figa_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/31d4eb904ada/13321_2016_175_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/5a39d007630a/13321_2016_175_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e3a/5195924/253a18745f48/13321_2016_175_Fig5_HTML.jpg

相似文献

1
ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.化学引擎:从PDF文件中提取补充数据的三维化学结构
J Cheminform. 2016 Dec 29;8:73. doi: 10.1186/s13321-016-0175-x. eCollection 2016.
2
PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format.PDFDataExtractor:一种从可移植文档格式中的排版文献中读取科学文本和解释元数据的工具。
J Chem Inf Model. 2022 Apr 11;62(7):1633-1643. doi: 10.1021/acs.jcim.1c01198. Epub 2022 Mar 29.
3
[Construction of chemical information database based on optical structure recognition technique].基于光学结构识别技术的化学信息数据库构建
Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):352-357.
4
Layout-aware text extraction from full-text PDF of scientific articles.从科学文章的全文PDF中进行布局感知文本提取。
Source Code Biol Med. 2012 May 28;7(1):7. doi: 10.1186/1751-0473-7-7.
5
MSL: Facilitating automatic and physical analysis of published scientific literature in PDF format.医学科学联络官:助力对PDF格式的已发表科学文献进行自动和实体分析。
F1000Res. 2015 Dec 16;4:1453. doi: 10.12688/f1000research.7329.3. eCollection 2015.
6
AsteriX: a Web server to automatically extract ligand coordinates from figures in PDF articles.AsteriX:一个从 PDF 文章中的图片中自动提取配体坐标的网络服务器。
J Chem Inf Model. 2012 Feb 27;52(2):568-76. doi: 10.1021/ci2004303. Epub 2012 Feb 16.
7
Three-Dimensional Portable Document Format (3D PDF) in Clinical Communication and Biomedical Sciences: Systematic Review of Applications, Tools, and Protocols.临床交流与生物医学科学中的三维便携式文档格式(3D PDF):应用、工具和协议的系统综述
JMIR Med Inform. 2018 Aug 7;6(3):e10295. doi: 10.2196/10295.
8
Embedding interactive, three-dimensional content in portable document format to deliver gross anatomy information and knowledge.将交互式、三维内容嵌入到可移植文档格式中,以提供大体解剖学信息和知识。
Clin Anat. 2021 Sep;34(6):919-933. doi: 10.1002/ca.23755. Epub 2021 May 21.
9
Efficient analysis and extraction of MS/MS result data from Mascot result files.从Mascot结果文件中高效分析和提取串联质谱(MS/MS)结果数据。
BMC Bioinformatics. 2005 Dec 7;6:290. doi: 10.1186/1471-2105-6-290.
10
Embedding and publishing interactive, 3-dimensional, scientific figures in Portable Document Format (PDF) files.将交互式、三维科学图形嵌入和发布到可移植文档格式(PDF)文件中。
PLoS One. 2013 Sep 25;8(9):e69446. doi: 10.1371/journal.pone.0069446. eCollection 2013.

本文引用的文献

1
A General Quantum Mechanically Derived Force Field (QMDFF) for Molecules and Condensed Phase Simulations.一种用于分子和凝聚相模拟的通用量子力学衍生力场(QMDFF)。
J Chem Theory Comput. 2014 Oct 14;10(10):4497-514. doi: 10.1021/ct500573f. Epub 2014 Sep 7.
2
Standards-based curation of a decade-old digital repository dataset of molecular information.基于标准对一个拥有十年历史的分子信息数字存储库数据集进行编目。
J Cheminform. 2015 Aug 27;7:43. doi: 10.1186/s13321-015-0093-3. eCollection 2015.
3
Curation of chemogenomics data.化学基因组学数据的管理
Nat Chem Biol. 2015 Aug;11(8):535. doi: 10.1038/nchembio.1881.
4
Role of Open Source Tools and Resources in Virtual Screening for Drug Discovery.开源工具和资源在药物发现虚拟筛选中的作用。
Comb Chem High Throughput Screen. 2015;18(6):528-43. doi: 10.2174/1386207318666150703111911.
5
MegaMiner: A Tool for Lead Identification Through Text Mining Using Chemoinformatics Tools and Cloud Computing Environment.MegaMiner:一种利用化学信息学工具和云计算环境通过文本挖掘进行潜在药物先导物识别的工具。
Comb Chem High Throughput Screen. 2015;18(6):591-603. doi: 10.2174/1386207318666150703113525.
6
Systems biology approaches for advancing the discovery of effective drug combinations.系统生物学方法在推进有效药物组合发现中的应用。
J Cheminform. 2015 Feb 26;7:7. doi: 10.1186/s13321-015-0055-9. eCollection 2015.
7
Drug discovery for neglected diseases: molecular target-based and phenotypic approaches.治疗被忽视疾病的药物研发:基于分子靶标和表型的方法。
J Med Chem. 2013 Oct 24;56(20):7719-26. doi: 10.1021/jm400362b. Epub 2013 Sep 9.
8
Thiol-ene click chemistry: computational and kinetic analysis of the influence of alkene functionality.巯基-烯点击化学:烯烃官能团影响的计算和动力学分析。
J Am Chem Soc. 2012 Aug 22;134(33):13804-17. doi: 10.1021/ja305441d. Epub 2012 Aug 8.
9
The Blue Obelisk-interoperability in chemical informatics.蓝色方尖碑——化学信息学中的互操作性。
J Chem Inf Model. 2006 May-Jun;46(3):991-8. doi: 10.1021/ci050400b.
10
Harvesting chemical information from the Internet using a distributed approach: ChemXtreme.使用分布式方法从互联网上获取化学信息:ChemXtreme
J Chem Inf Model. 2006 Mar-Apr;46(2):452-61. doi: 10.1021/ci050329+.