• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于光学结构识别技术的化学信息数据库构建

[Construction of chemical information database based on optical structure recognition technique].

作者信息

Lv C Y, Li M N, Zhang L R, Liu Z M

机构信息

State Key Laboratory of Natural and Biomimetic Drugs, Peking University School of Pharmaceutical Sciences, Beijing 100191, China.

出版信息

Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):352-357.

PMID:29643539
Abstract

OBJECTIVE

To create a protocol that could be used to construct chemical information database from scientific literature quickly and automatically.

METHODS

Scientific literature, patents and technical reports from different chemical disciplines were collected and stored in PDF format as fundamental datasets. Chemical structures were transformed from published documents and images to machine-readable data by using the name conversion technology and optical structure recognition tool CLiDE. In the process of molecular structure information extraction, Markush structures were enumerated into well-defined monomer molecules by means of QueryTools in molecule editor ChemDraw. Document management software EndNote X8 was applied to acquire bibliographical references involving title, author, journal and year of publication. Text mining toolkit ChemDataExtractor was adopted to retrieve information that could be used to populate structured chemical database from figures, tables, and textual paragraphs. After this step, detailed manual revision and annotation were conducted in order to ensure the accuracy and completeness of the data. In addition to the literature data, computing simulation platform Pipeline Pilot 7.5 was utilized to calculate the physical and chemical properties and predict molecular attributes. Furthermore, open database ChEMBL was linked to fetch known bioactivities, such as indications and targets. After information extraction and data expansion, five separate metadata files were generated, including molecular structure data file, molecular information, bibliographical references, predictable attributes and known bioactivities. Canonical simplified molecular input line entry specification as primary key, metadata files were associated through common key nodes including molecular number and PDF number to construct an integrated chemical information database.

RESULTS

A reasonable construction protocol of chemical information database was created successfully. A total of 174 research articles and 25 reviews published in Marine Drugs from January 2015 to June 2016 collected as essential data source, and an elementary marine natural product database named PKU-MNPD was built in accordance with this protocol, which contained 3 262 molecules and 19 821 records.

CONCLUSION

This data aggregation protocol is of great help for the chemical information database construction in accuracy, comprehensiveness and efficiency based on original documents. The structured chemical information database can facilitate the access to medical intelligence and accelerate the transformation of scientific research achievements.

摘要

目的

创建一种可用于快速自动地从科学文献构建化学信息数据库的方案。

方法

收集不同化学学科的科学文献、专利和技术报告,并以PDF格式存储作为基础数据集。通过名称转换技术和光学结构识别工具CLiDE,将已发表文献和图像中的化学结构转换为机器可读数据。在分子结构信息提取过程中,利用分子编辑器ChemDraw中的QueryTools将马库什结构枚举为明确的单体分子。应用文献管理软件EndNote X8获取涉及标题、作者、期刊和出版年份的参考文献。采用文本挖掘工具包ChemDataExtractor从图表和文本段落中检索可用于填充结构化化学数据库的信息。此步骤之后,进行详细的人工修订和注释以确保数据的准确性和完整性。除了文献数据外,利用计算模拟平台Pipeline Pilot 7.5计算物理化学性质并预测分子属性。此外,链接开放数据库ChEMBL以获取已知生物活性,如适应症和靶点。经过信息提取和数据扩展后,生成了五个单独的元数据文件,包括分子结构数据文件、分子信息、参考文献、可预测属性和已知生物活性。以规范的简化分子输入线性条目规范作为主键,通过包括分子编号和PDF编号在内的公共关键节点关联元数据文件,构建综合化学信息数据库。

结果

成功创建了合理的化学信息数据库构建方案。收集了2015年1月至2016年6月发表在《海洋药物》上的174篇研究文章和25篇综述作为基本数据源,并按照此方案构建了一个名为PKU-MNPD的基础海洋天然产物数据库,该数据库包含3262个分子和19821条记录。

结论

该数据聚合方案对基于原始文献的化学信息数据库建设在准确性、全面性和效率方面有很大帮助。结构化化学信息数据库有助于获取医学情报并加速科研成果转化。

相似文献

1
[Construction of chemical information database based on optical structure recognition technique].基于光学结构识别技术的化学信息数据库构建
Beijing Da Xue Xue Bao Yi Xue Ban. 2018 Apr 18;50(2):352-357.
2
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.ChemDataExtractor:一个用于从科学文献中自动提取化学信息的工具包。
J Chem Inf Model. 2016 Oct 24;56(10):1894-1904. doi: 10.1021/acs.jcim.6b00207. Epub 2016 Oct 6.
3
PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format.PDFDataExtractor:一种从可移植文档格式中的排版文献中读取科学文本和解释元数据的工具。
J Chem Inf Model. 2022 Apr 11;62(7):1633-1643. doi: 10.1021/acs.jcim.1c01198. Epub 2022 Mar 29.
4
Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space.从图像和文本中进行多模态化学信息重构,以探索近药物空间。
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac461.
5
The Natural Products Atlas: An Open Access Knowledge Base for Microbial Natural Products Discovery.《天然产物图谱:微生物天然产物发现的开放获取知识库》
ACS Cent Sci. 2019 Nov 27;5(11):1824-1833. doi: 10.1021/acscentsci.9b00806. Epub 2019 Nov 14.
6
PDF text classification to leverage information extraction from publication reports.利用出版物报告中的信息提取进行PDF文本分类。
J Biomed Inform. 2016 Jun;61:141-8. doi: 10.1016/j.jbi.2016.03.026. Epub 2016 Apr 1.
7
DECIMER-Segmentation: Automated extraction of chemical structure depictions from scientific literature.DECIMER-分割:从科学文献中自动提取化学结构描绘。
J Cheminform. 2021 Mar 8;13(1):20. doi: 10.1186/s13321-021-00496-1.
8
ChemEngine: harvesting 3D chemical structures of supplementary data from PDF files.化学引擎:从PDF文件中提取补充数据的三维化学结构
J Cheminform. 2016 Dec 29;8:73. doi: 10.1186/s13321-016-0175-x. eCollection 2016.
9
CHEMSCANNER: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files.化学扫描器:从包含ChemDraw文件的常见科学文档中提取化学信息并进行重复使用(能力)。
J Cheminform. 2019 Dec 11;11(1):77. doi: 10.1186/s13321-019-0400-5.
10
Information Retrieval and Text Mining Technologies for Chemistry.化学信息检索与文本挖掘技术。
Chem Rev. 2017 Jun 28;117(12):7673-7761. doi: 10.1021/acs.chemrev.6b00851. Epub 2017 May 5.