从专利中自动识别相关化合物。

Automatic identification of relevant chemical compounds from patents.

机构信息

Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, CA, Netherlands.

Elsevier B.V., Radarweg 29, Amsterdam NX, The Netherlands.

出版信息

Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.

DOI:10.1093/database/baz001

PMID:30698776

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6351730/

Abstract

In commercial research and development projects, public disclosure of new chemical compounds often takes place in patents. Only a small proportion of these compounds are published in journals, usually a few years after the patent. Patent authorities make available the patents but do not provide systematic continuous chemical annotations. Content databases such as Elsevier's Reaxys provide such services mostly based on manual excerptions, which are time-consuming and costly. Automatic text-mining approaches help overcome some of the limitations of the manual process. Different text-mining approaches exist to extract chemical entities from patents. The majority of them have been developed using sub-sections of patent documents and focus on mentions of compounds. Less attention has been given to relevancy of a compound in a patent. Relevancy of a compound to a patent is based on the patent's context. A relevant compound plays a major role within a patent. Identification of relevant compounds reduces the size of the extracted data and improves the usefulness of patent resources (e.g. supports identifying the main compounds). Annotators of databases like Reaxys only annotate relevant compounds. In this study, we design an automated system that extracts chemical entities from patents and classifies their relevance. The gold-standard set contained 18 789 chemical entity annotations. Of these, 10% were relevant compounds, 88% were irrelevant and 2% were equivocal. Our compound recognition system was based on proprietary tools. The performance (F-score) of the system on compound recognition was 84% on the development set and 86% on the test set. The relevancy classification system had an F-score of 86% on the development set and 82% on the test set. Our system can extract chemical compounds from patents and classify their relevance with high performance. This enables the extension of the Reaxys database by means of automation.

摘要

在商业研究和开发项目中，新化合物的公开通常发生在专利中。这些化合物中只有一小部分发表在期刊上，通常是在专利几年后。专利局提供专利，但不提供系统的连续化学注释。Elsevier 的 Reaxys 等内容数据库提供了此类服务，主要基于手动摘录，既费时又昂贵。自动文本挖掘方法有助于克服手动过程的一些限制。存在不同的文本挖掘方法来从专利中提取化学实体。其中大多数是使用专利文件的子部分开发的，重点是化合物的提及。对化合物在专利中的相关性关注较少。化合物在专利中的相关性基于专利的上下文。相关化合物在专利中起着重要作用。识别相关化合物可以缩小提取数据的大小并提高专利资源的有用性（例如，支持识别主要化合物）。Reaxys 等数据库的注释员仅注释相关化合物。在这项研究中，我们设计了一种从专利中提取化学实体并对其相关性进行分类的自动化系统。黄金标准集包含 18789 个化学实体注释。其中，10%是相关化合物，88%是不相关的，2%是模棱两可的。我们的化合物识别系统基于专有工具。该系统在化合物识别上的性能（F 分数）在开发集上为 84%，在测试集上为 86%。相关性分类系统在开发集上的 F 分数为 86%，在测试集上为 82%。我们的系统可以从专利中提取化学化合物并对其相关性进行分类，性能很高。这使得可以通过自动化扩展 Reaxys 数据库。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/cf15/6351730/71a1a52a2f66/baz001f1.jpg

相似文献

Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。

Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.

Annotated chemical patent corpus: a gold standard for text mining.带注释的化学专利语料库：文本挖掘的黄金标准。

PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.

Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.管理预期：对通过从专利中自动提取化学结构生成的化学数据库的评估。

J Cheminform. 2015 Oct 6;7(1):49. doi: 10.1186/s13321-015-0097-z. eCollection 2015 Dec.

Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别

Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.通过结合基于词典和统计的方法进行专利中的化学实体识别。

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

SureChEMBL: a large-scale, chemically annotated patent document database.SureChEMBL：一个大规模的、经过化学注释的专利文献数据库。

Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8. doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.

Ontology-based content analysis of US patent applications from 2001-2010.基于本体的2001年至2010年美国专利申请内容分析。

Pharm Pat Anal. 2013 Jan;2(1):39-54. doi: 10.4155/ppa.12.76.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Development of an information retrieval tool for biomedical patents.生物医学专利信息检索工具的开发。

Comput Methods Programs Biomed. 2018 Jun;159:125-134. doi: 10.1016/j.cmpb.2018.03.012. Epub 2018 Mar 14.

Mining chemical patents with an ensemble of open systems.利用开放系统集成挖掘化学专利。

Database (Oxford). 2016 May 12;2016. doi: 10.1093/database/baw065. Print 2016.

引用本文的文献

PatCID: an open-access dataset of chemical structures in patent documents.PatCID：专利文件中化学结构的开放获取数据集。

Nat Commun. 2024 Aug 2;15(1):6532. doi: 10.1038/s41467-024-50779-y.

OSPAR: A Corpus for Extraction of Organic Synthesis Procedures with Argument Roles.OSPAR：用于提取具有论元角色的有机合成过程的语料库。

J Chem Inf Model. 2023 Nov 13;63(21):6619-6628. doi: 10.1021/acs.jcim.3c01449. Epub 2023 Oct 19.

Deep learning-based automatic action extraction from structured chemical synthesis procedures.基于深度学习从结构化化学合成程序中自动提取操作

PeerJ Comput Sci. 2023 Aug 18;9:e1511. doi: 10.7717/peerj-cs.1511. eCollection 2023.

PubChem 2023 update.PubChem 2023 更新。

Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380. doi: 10.1093/nar/gkac956.

Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space.从图像和文本中进行多模态化学信息重构，以探索近药物空间。

Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac461.

Old drugs, new tricks: leveraging known compounds to disrupt coronavirus-induced cytokine storm.老药新用：利用已知化合物来阻断冠状病毒引起的细胞因子风暴。

NPJ Syst Biol Appl. 2022 Oct 10;8(1):38. doi: 10.1038/s41540-022-00250-9.

From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents.从词法分析到自我监督：构建用于专利中化学反应的高性能信息提取系统。

Front Res Metr Anal. 2021 Dec 22;6:691105. doi: 10.3389/frma.2021.691105. eCollection 2021.

ChemTables: a dataset for semantic classification on tables in chemical patents.化学表格：一个用于化学专利表格语义分类的数据集。

J Cheminform. 2021 Dec 11;13(1):97. doi: 10.1186/s13321-021-00568-2.

Congenericity of Claimed Compounds in Patent Applications.专利申请中声称化合物的同源性。

Molecules. 2021 Aug 30;26(17):5253. doi: 10.3390/molecules26175253.

Statistics of the Popularity of Chemical Compounds in Relation to the Non-Target Analysis.与非靶分析相关的化合物受欢迎程度的统计。

Molecules. 2021 Apr 20;26(8):2394. doi: 10.3390/molecules26082394.

本文引用的文献

Information Retrieval and Text Mining Technologies for Chemistry.化学信息检索与文本挖掘技术。

Chem Rev. 2017 Jun 28;117(12):7673-7761. doi: 10.1021/acs.chemrev.6b00851. Epub 2017 May 5.

The Markyt visualisation, prediction and benchmark platform for chemical and gene entity recognition at BioCreative/CHEMDNER challenge.用于生物创意/化学命名实体识别挑战赛中化学和基因实体识别的Markyt可视化、预测和基准测试平台。

Database (Oxford). 2016 Aug 19;2016. doi: 10.1093/database/baw120. Print 2016.

Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications.文本挖掘在药物和化学化合物中的应用：方法、工具和应用。

Mol Inform. 2011 Jun;30(6-7):506-19. doi: 10.1002/minf.201100005. Epub 2011 Jul 12.

Chemical entity recognition in patents by combining dictionary-based and statistical approaches.通过结合基于词典和统计的方法进行专利中的化学实体识别。

Database (Oxford). 2016 May 2;2016. doi: 10.1093/database/baw061. Print 2016.

SureChEMBL: a large-scale, chemically annotated patent document database.SureChEMBL：一个大规模的、经过化学注释的专利文献数据库。

Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8. doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.

Ambiguity of non-systematic chemical identifiers within and between small-molecule databases.小分子数据库内部及之间非系统化学标识符的模糊性。

J Cheminform. 2015 Nov 16;7:54. doi: 10.1186/s13321-015-0102-6. eCollection 2015.

J Cheminform. 2015 Oct 6;7(1):49. doi: 10.1186/s13321-015-0097-z. eCollection 2015 Dec.

PubChem Substance and Compound databases.美国国立医学图书馆化学物质数据库和化合物数据库。

Nucleic Acids Res. 2016 Jan 4;44(D1):D1202-13. doi: 10.1093/nar/gkv951. Epub 2015 Sep 22.

Quantitative determination of technological improvement from patent data.基于专利数据的技术改进定量测定。

PLoS One. 2015 Apr 15;10(4):e0121635. doi: 10.1371/journal.pone.0121635. eCollection 2015.

The CHEMDNER corpus of chemicals and drugs and its annotation principles.CHEMDNER 化学物质和药物语料库及其标注原则。

J Cheminform. 2015 Jan 19;7(Suppl 1 Text mining for chemistry and the CHEMDNER track):S2. doi: 10.1186/1758-2946-7-S1-S2. eCollection 2015.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

从专利中自动识别相关化合物。

Automatic identification of relevant chemical compounds from patents.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献