PatCID：专利文件中化学结构的开放获取数据集。

PatCID: an open-access dataset of chemical structures in patent documents.

作者信息

Morin Lucas, Weber Valéry, Meijer Gerhard Ingmar, Yu Fisher, Staar Peter W J

机构信息

IBM Research, Säumerstrasse 4, 8803, Rüschlikon, Switzerland.

Department of Information Technology and Electrical Engineering, ETH Zürich, Sternwartstrasse 7, 8092, Zürich, Switzerland.

出版信息

Nat Commun. 2024 Aug 2;15(1):6532. doi: 10.1038/s41467-024-50779-y.

DOI:10.1038/s41467-024-50779-y

PMID:39095357

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11297020/

Abstract

The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

摘要

专利出版物的自动分析有潜力加速包括药物发现和材料科学在内的各个领域的研究。在专利文件中，关键信息通常存在于分子结构的可视化描述中。PatCID（用于发现的专利提取化学结构图像数据库）允许大规模获取此类信息。它使用户能够搜索哪些分子出现在哪些文件中。PatCID包含8100万个化学结构图像和1400万个独特的化学结构。在此，我们将PatCID与最先进的化学专利数据库进行比较。在一个随机集合上，PatCID检索到56.0%的分子，高于自动创建的数据库谷歌专利（41.5%）和SureChEMBL（23.5%），以及手动创建的数据库Reaxys（53.5%）和SciFinder（49.5%）。利用最先进的文档理解方法，PatCID的高质量数据优于目前可用的自动生成的专利数据库。PatCID甚至能与专有的手动创建的专利数据库竞争。这为自动文献综述和基于学习的分子生成方法带来了有前景的应用。该数据集可免费下载。