• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

化学表格:一个用于化学专利表格语义分类的数据集。

ChemTables: a dataset for semantic classification on tables in chemical patents.

作者信息

Zhai Zenan, Druckenbrodt Christian, Thorne Camilo, Akhondi Saber A, Nguyen Dat Quoc, Cohn Trevor, Verspoor Karin

机构信息

School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia.

Elsevier-Data Science, Life Science, Amsterdam, The Netherlands.

出版信息

J Cheminform. 2021 Dec 11;13(1):97. doi: 10.1186/s13321-021-00568-2.

DOI:10.1186/s13321-021-00568-2
PMID:34895295
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8665561/
Abstract

Chemical patents are a commonly used channel for disclosing novel compounds and reactions, and hence represent important resources for chemical and pharmaceutical research. Key chemical data in patents is often presented in tables. Both the number and the size of tables can be very large in patent documents. In addition, various types of information can be presented in tables in patents, including spectroscopic and physical data, or pharmacological use and effects of chemicals. Since images of Markush structures and merged cells are commonly used in these tables, their structure also shows substantial variation. This heterogeneity in content and structure of tables in chemical patents makes relevant information difficult to find. We therefore propose a new text mining task of automatically categorising tables in chemical patents based on their contents. Categorisation of tables based on the nature of their content can help to identify tables containing key information, improving the accessibility of information in patents that is highly relevant for new inventions. For developing and evaluating methods for the table classification task, we developed a new dataset, called CHEMTABLES, which consists of 788 chemical patent tables with labels of their content type. We introduce this data set in detail. We further establish strong baselines for the table classification task in chemical patents by applying state-of-the-art neural network models developed for natural language processing, including TabNet, ResNet and Table-BERT on CHEMTABLES. The best performing model, Table-BERT, achieves a performance of 88.66 micro-averaged [Formula: see text] score on the table classification task. The CHEMTABLES dataset is publicly available at https://doi.org/10.17632/g7tjh7tbrj.3 , subject to the CC BY NC 3.0 license. Code/models evaluated in this work are in a Github repository https://github.com/zenanz/ChemTables .

摘要

化学专利是公开新化合物和反应的常用渠道,因此是化学和制药研究的重要资源。专利中的关键化学数据通常以表格形式呈现。专利文件中表格的数量和规模可能都非常大。此外,专利表格中可以呈现各种类型的信息,包括光谱和物理数据,或化学品的药理用途和效果。由于这些表格中通常使用马库什结构和合并单元格的图像,其结构也呈现出很大的差异。化学专利表格在内容和结构上的这种异质性使得相关信息难以查找。因此,我们提出了一项新的文本挖掘任务,即根据化学专利表格的内容对其进行自动分类。根据表格内容的性质进行分类有助于识别包含关键信息的表格,提高专利中与新发明高度相关的信息的可获取性。为了开发和评估表格分类任务的方法,我们开发了一个名为CHEMTABLES的新数据集,它由788个化学专利表格组成,并带有其内容类型的标签。我们详细介绍了这个数据集。我们还通过在CHEMTABLES上应用为自然语言处理开发的先进神经网络模型,包括TabNet、ResNet和Table - BERT,为化学专利表格分类任务建立了强大的基线。表现最佳的模型Table - BERT在表格分类任务上的微平均[公式:见正文]分数达到了88.66。CHEMTABLES数据集可在https://doi.org/10.17632/g7tjh7tbrj.3上公开获取,遵循CC BY NC 3.0许可协议。本工作中评估的代码/模型存于Github仓库https://github.com/zenanz/ChemTables 。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/2fe7fdb66653/13321_2021_568_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/e645e95ebb90/13321_2021_568_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/0d5103686f55/13321_2021_568_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/3481deff6354/13321_2021_568_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/5758a3ce6f26/13321_2021_568_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/5d38fc502d39/13321_2021_568_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/82a31b5d7520/13321_2021_568_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/aaf7b83c9632/13321_2021_568_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/169083e1255d/13321_2021_568_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/b3f1b77dfa6d/13321_2021_568_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/02d36ca1dcf4/13321_2021_568_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/2fe7fdb66653/13321_2021_568_Fig11_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/e645e95ebb90/13321_2021_568_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/0d5103686f55/13321_2021_568_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/3481deff6354/13321_2021_568_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/5758a3ce6f26/13321_2021_568_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/5d38fc502d39/13321_2021_568_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/82a31b5d7520/13321_2021_568_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/aaf7b83c9632/13321_2021_568_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/169083e1255d/13321_2021_568_Fig8_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/b3f1b77dfa6d/13321_2021_568_Fig9_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/02d36ca1dcf4/13321_2021_568_Fig10_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b0a5/8665561/2fe7fdb66653/13321_2021_568_Fig11_HTML.jpg

相似文献

1
ChemTables: a dataset for semantic classification on tables in chemical patents.化学表格:一个用于化学专利表格语义分类的数据集。
J Cheminform. 2021 Dec 11;13(1):97. doi: 10.1186/s13321-021-00568-2.
2
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
3
Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。
Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.
4
Chemical named entity recognition in patents by domain knowledge and unsupervised feature learning.基于领域知识和无监督特征学习的专利中化学命名实体识别
Database (Oxford). 2016 Apr 17;2016. doi: 10.1093/database/baw049. Print 2016.
5
ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents.ChEMU 2020:自然语言处理方法对从化学专利中提取信息有效。
Front Res Metr Anal. 2021 Mar 25;6:654438. doi: 10.3389/frma.2021.654438. eCollection 2021.
6
Identifying Chemical Reactions and Their Associated Attributes in Patents.识别专利中的化学反应及其相关属性。
Front Res Metr Anal. 2021 Jul 12;6:688353. doi: 10.3389/frma.2021.688353. eCollection 2021.
7
Intuitive patent Markush structure visualization tool for medicinal chemists.直观的专利 Markush 结构可视化工具,供药物化学家使用。
J Chem Inf Model. 2011 Mar 28;51(3):511-20. doi: 10.1021/ci100261u. Epub 2011 Mar 7.
8
Comparison of different feature extraction methods for applicable automated ICD coding.不同特征提取方法在适用的自动化 ICD 编码中的比较。
BMC Med Inform Decis Mak. 2022 Jan 12;22(1):11. doi: 10.1186/s12911-022-01753-5.
9
From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents.从词法分析到自我监督:构建用于专利中化学反应的高性能信息提取系统。
Front Res Metr Anal. 2021 Dec 22;6:691105. doi: 10.3389/frma.2021.691105. eCollection 2021.
10
When BERT meets Bilbo: a learning curve analysis of pretrained language model on disease classification.当 BERT 遇见比尔博:预训练语言模型在疾病分类上的学习曲线分析。
BMC Med Inform Decis Mak. 2022 Apr 5;21(Suppl 9):377. doi: 10.1186/s12911-022-01829-2.

引用本文的文献

1
A machine learning driven automated system to extract multiple information fields from safety data sheet documents.一种由机器学习驱动的自动化系统,用于从安全数据表文档中提取多个信息字段。
Heliyon. 2025 Jan 27;11(4):e42215. doi: 10.1016/j.heliyon.2025.e42215. eCollection 2025 Feb 28.
2
Mining Patents with Large Language Models Elucidates the Chemical Function Landscape.利用大语言模型挖掘专利阐明化学功能格局。
ArXiv. 2023 Dec 18:arXiv:2309.08765v2.

本文引用的文献

1
Automatic identification of relevant chemical compounds from patents.从专利中自动识别相关化合物。
Database (Oxford). 2019 Jan 1;2019:baz001. doi: 10.1093/database/baz001.
2
Deep learning with word embeddings improves biomedical named entity recognition.使用词嵌入的深度学习可改善生物医学命名实体识别。
Bioinformatics. 2017 Jul 15;33(14):i37-i48. doi: 10.1093/bioinformatics/btx228.
3
ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature.ChemDataExtractor:一个用于从科学文献中自动提取化学信息的工具包。
J Chem Inf Model. 2016 Oct 24;56(10):1894-1904. doi: 10.1021/acs.jcim.6b00207. Epub 2016 Oct 6.
4
Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents.管理预期:对通过从专利中自动提取化学结构生成的化学数据库的评估。
J Cheminform. 2015 Oct 6;7(1):49. doi: 10.1186/s13321-015-0097-z. eCollection 2015 Dec.
5
Annotated chemical patent corpus: a gold standard for text mining.带注释的化学专利语料库:文本挖掘的黄金标准。
PLoS One. 2014 Sep 30;9(9):e107477. doi: 10.1371/journal.pone.0107477. eCollection 2014.
6
Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data.充分利用每一个 SAR 点:为大规模整合结构和生物活性数据而开发的 Chemistry Connect。
Drug Discov Today. 2011 Dec;16(23-24):1019-30. doi: 10.1016/j.drudis.2011.10.005. Epub 2011 Oct 14.
7
OSCAR4: a flexible architecture for chemical text-mining.OSCAR4:一种用于化学文本挖掘的灵活架构。
J Cheminform. 2011 Oct 14;3(1):41. doi: 10.1186/1758-2946-3-41.
8
Chemical name to structure: OPSIN, an open source solution.化学名到结构:视蛋白,一个开源解决方案。
J Chem Inf Model. 2011 Mar 28;51(3):739-53. doi: 10.1021/ci100384d. Epub 2011 Mar 9.
9
Long short-term memory.长短期记忆
Neural Comput. 1997 Nov 15;9(8):1735-80. doi: 10.1162/neco.1997.9.8.1735.