基于光学结构识别技术的化学信息数据库构建

OBJECTIVE

To create a protocol that could be used to construct chemical information database from scientific literature quickly and automatically.

METHODS

Scientific literature, patents and technical reports from different chemical disciplines were collected and stored in PDF format as fundamental datasets. Chemical structures were transformed from published documents and images to machine-readable data by using the name conversion technology and optical structure recognition tool CLiDE. In the process of molecular structure information extraction, Markush structures were enumerated into well-defined monomer molecules by means of QueryTools in molecule editor ChemDraw. Document management software EndNote X8 was applied to acquire bibliographical references involving title, author, journal and year of publication. Text mining toolkit ChemDataExtractor was adopted to retrieve information that could be used to populate structured chemical database from figures, tables, and textual paragraphs. After this step, detailed manual revision and annotation were conducted in order to ensure the accuracy and completeness of the data. In addition to the literature data, computing simulation platform Pipeline Pilot 7.5 was utilized to calculate the physical and chemical properties and predict molecular attributes. Furthermore, open database ChEMBL was linked to fetch known bioactivities, such as indications and targets. After information extraction and data expansion, five separate metadata files were generated, including molecular structure data file, molecular information, bibliographical references, predictable attributes and known bioactivities. Canonical simplified molecular input line entry specification as primary key, metadata files were associated through common key nodes including molecular number and PDF number to construct an integrated chemical information database.

RESULTS

A reasonable construction protocol of chemical information database was created successfully. A total of 174 research articles and 25 reviews published in Marine Drugs from January 2015 to June 2016 collected as essential data source, and an elementary marine natural product database named PKU-MNPD was built in accordance with this protocol, which contained 3 262 molecules and 19 821 records.

CONCLUSION

This data aggregation protocol is of great help for the chemical information database construction in accuracy, comprehensiveness and efficiency based on original documents. The structured chemical information database can facilitate the access to medical intelligence and accelerate the transformation of scientific research achievements.

目的

创建一种可用于快速自动地从科学文献构建化学信息数据库的方案。

方法

收集不同化学学科的科学文献、专利和技术报告，并以PDF格式存储作为基础数据集。通过名称转换技术和光学结构识别工具CLiDE，将已发表文献和图像中的化学结构转换为机器可读数据。在分子结构信息提取过程中，利用分子编辑器ChemDraw中的QueryTools将马库什结构枚举为明确的单体分子。应用文献管理软件EndNote X8获取涉及标题、作者、期刊和出版年份的参考文献。采用文本挖掘工具包ChemDataExtractor从图表和文本段落中检索可用于填充结构化化学数据库的信息。此步骤之后，进行详细的人工修订和注释以确保数据的准确性和完整性。除了文献数据外，利用计算模拟平台Pipeline Pilot 7.5计算物理化学性质并预测分子属性。此外，链接开放数据库ChEMBL以获取已知生物活性，如适应症和靶点。经过信息提取和数据扩展后，生成了五个单独的元数据文件，包括分子结构数据文件、分子信息、参考文献、可预测属性和已知生物活性。以规范的简化分子输入线性条目规范作为主键，通过包括分子编号和PDF编号在内的公共关键节点关联元数据文件，构建综合化学信息数据库。

结果

成功创建了合理的化学信息数据库构建方案。收集了2015年1月至2016年6月发表在《海洋药物》上的174篇研究文章和25篇综述作为基本数据源，并按照此方案构建了一个名为PKU-MNPD的基础海洋天然产物数据库，该数据库包含3262个分子和19821条记录。

结论

该数据聚合方案对基于原始文献的化学信息数据库建设在准确性、全面性和效率方面有很大帮助。结构化化学信息数据库有助于获取医学情报并加速科研成果转化。

[Construction of chemical information database based on optical structure recognition technique].

作者信息

机构信息

出版信息

OBJECTIVE

METHODS

RESULTS

CONCLUSION

目的

方法

结果

结论

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献