• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

PatCID:专利文件中化学结构的开放获取数据集。

PatCID: an open-access dataset of chemical structures in patent documents.

作者信息

Morin Lucas, Weber Valéry, Meijer Gerhard Ingmar, Yu Fisher, Staar Peter W J

机构信息

IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.

Department of Information Technology and Electrical Engineering, ETH Zürich, Sternwartstrasse 7, 8092, Zürich, Switzerland.

出版信息

Nat Commun. 2024 Aug 2;15(1):6532. doi: 10.1038/s41467-024-50779-y.

DOI:10.1038/s41467-024-50779-y
PMID:39095357
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297020/
Abstract

The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

摘要

专利出版物的自动分析有潜力加速包括药物发现和材料科学在内的各个领域的研究。在专利文件中,关键信息通常存在于分子结构的可视化描述中。PatCID(用于发现的专利提取化学结构图像数据库)允许大规模获取此类信息。它使用户能够搜索哪些分子出现在哪些文件中。PatCID包含8100万个化学结构图像和1400万个独特的化学结构。在此,我们将PatCID与最先进的化学专利数据库进行比较。在一个随机集合上,PatCID检索到56.0%的分子,高于自动创建的数据库谷歌专利(41.5%)和SureChEMBL(23.5%),以及手动创建的数据库Reaxys(53.5%)和SciFinder(49.5%)。利用最先进的文档理解方法,PatCID的高质量数据优于目前可用的自动生成的专利数据库。PatCID甚至能与专有的手动创建的专利数据库竞争。这为自动文献综述和基于学习的分子生成方法带来了有前景的应用。该数据集可免费下载。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/61eae8d0d1ac/41467_2024_50779_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/89a3703c96c7/41467_2024_50779_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/dc30b7d5a545/41467_2024_50779_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/445bcc5d528a/41467_2024_50779_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/61eae8d0d1ac/41467_2024_50779_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/89a3703c96c7/41467_2024_50779_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/dc30b7d5a545/41467_2024_50779_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/445bcc5d528a/41467_2024_50779_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/51e3/11297020/61eae8d0d1ac/41467_2024_50779_Fig4_HTML.jpg

相似文献

1
PatCID: an open-access dataset of chemical structures in patent documents.PatCID:专利文件中化学结构的开放获取数据集。
Nat Commun. 2024 Aug 2;15(1):6532. doi: 10.1038/s41467-024-50779-y.
2
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.管理预期:对通过从专利中自动提取化学结构生成的化学数据库的评估。
J Cheminform. 2015 Oct 6;7(1):49. doi: 10.1186/s13321-015-0097-z. eCollection 2015 Dec.
3
SureChEMBL: a large-scale, chemically annotated patent document database.SureChEMBL:一个大规模的、经过化学注释的专利文献数据库。
Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8. doi: 10.1093/nar/gkv1253. Epub 2015 Nov 17.
4
Author Correction: PatCID: an open-access dataset of chemical structures in patent documents.作者更正:PatCID:专利文件中化学结构的开放获取数据集。
Nat Commun. 2025 Jan 2;16(1):224. doi: 10.1038/s41467-024-55566-3.
5
Identification of the Core Chemical Structure in SureChEMBL Patents.SureChEMBL 专利核心化学结构的鉴定。
J Chem Inf Model. 2021 May 24;61(5):2241-2247. doi: 10.1021/acs.jcim.1c00151. Epub 2021 Apr 30.
6
Assessment of the significance of patent-derived information for the early identification of compound-target interaction hypotheses.评估专利衍生信息对早期识别化合物-靶点相互作用假设的重要性。
J Cheminform. 2017 Apr 21;9(1):26. doi: 10.1186/s13321-017-0214-2.
7
AI-driven molecular generation of not-patented pharmaceutical compounds using world open patent data.利用世界公开专利数据,通过人工智能驱动进行非专利药物化合物的分子生成。
J Cheminform. 2023 Dec 13;15(1):120. doi: 10.1186/s13321-023-00791-z.
8
Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。
Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.
9
Exploring SureChEMBL from a drug discovery perspective.从药物发现的角度探索 SureChEMBL。
Sci Data. 2024 May 16;11(1):507. doi: 10.1038/s41597-024-03371-4.
10
CIPSI: An open chemical intellectual property service for medicinal chemists.CIPSI:面向药物化学家的开放化学知识产权服务。
Mol Inform. 2024 Jan;43(1):e202300221. doi: 10.1002/minf.202300221. Epub 2023 Dec 12.

引用本文的文献

1
BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data.2024年的BindingDB:蛋白质-小分子结合数据的可 FAIR 化知识库。
Nucleic Acids Res. 2025 Jan 6;53(D1):D1633-D1644. doi: 10.1093/nar/gkae1075.
2
Revealing Chemical Trends: Insights from Data-Driven Visualization and Patent Analysis in Exposomics Research.揭示化学趋势:暴露组学研究中数据驱动可视化和专利分析的见解
Environ Sci Technol Lett. 2024 Aug 30;11(10):1046-1052. doi: 10.1021/acs.estlett.4c00560. eCollection 2024 Oct 8.

本文引用的文献

1
Augmenting large language models with chemistry tools.用化学工具增强大语言模型。
Nat Mach Intell. 2024;6(5):525-535. doi: 10.1038/s42256-024-00832-8. Epub 2024 May 8.
2
Exploring SureChEMBL from a drug discovery perspective.从药物发现的角度探索 SureChEMBL。
Sci Data. 2024 May 16;11(1):507. doi: 10.1038/s41597-024-03371-4.
3
AI-driven molecular generation of not-patented pharmaceutical compounds using world open patent data.利用世界公开专利数据,通过人工智能驱动进行非专利药物化合物的分子生成。
J Cheminform. 2023 Dec 13;15(1):120. doi: 10.1186/s13321-023-00791-z.
4
YoDe-Segmentation: automated noise-free retrieval of molecular structures from scientific publications.YoDe分割:从科学出版物中自动无噪声检索分子结构。
J Cheminform. 2023 Nov 20;15(1):111. doi: 10.1186/s13321-023-00783-z.
5
αExtractor: a system for automatic extraction of chemical information from biomedical literature.α提取器:一种从生物医学文献中自动提取化学信息的系统。
Sci China Life Sci. 2024 Mar;67(3):618-621. doi: 10.1007/s11427-023-2388-x. Epub 2023 Sep 26.
6
DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications.DECIMER.ai:一个用于科学出版物中光学化学结构自动识别、分割和识别的开放平台。
Nat Commun. 2023 Aug 19;14(1):5045. doi: 10.1038/s41467-023-40782-0.
7
Illuminating the druggable genome through patent bioactivity data.通过专利生物活性数据揭示可成药性基因组。
PeerJ. 2023 May 2;11:e15153. doi: 10.7717/peerj.15153. eCollection 2023.
8
MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation.MolScribe:通过图像到图形生成实现强大的分子结构识别。
J Chem Inf Model. 2023 Apr 10;63(7):1925-1934. doi: 10.1021/acs.jcim.2c01480. Epub 2023 Mar 27.
9
Papers and patents are becoming less disruptive over time.随着时间的推移,论文和专利的颠覆性越来越小。
Nature. 2023 Jan;613(7942):138-144. doi: 10.1038/s41586-022-05543-x. Epub 2023 Jan 4.
10
PubChem 2023 update.PubChem 2023 更新。
Nucleic Acids Res. 2023 Jan 6;51(D1):D1373-D1380. doi: 10.1093/nar/gkac956.