• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

一个用于提取过程-结构-属性实体的基于本体的文本挖掘数据集。

An ontology-based text mining dataset for extraction of process-structure-property entities.

作者信息

Durmaz Ali Riza, Thomas Akhil, Mishra Lokesh, Murthy Rachana Niranjan, Straub Thomas

机构信息

Fraunhofer Institute for Mechanics of Materials IWM, Freiburg im Breisgau, 79108, Germany.

University of Freiburg, Freiburg, 79098, Germany.

出版信息

Sci Data. 2024 Oct 10;11(1):1112. doi: 10.1038/s41597-024-03926-5.

DOI:10.1038/s41597-024-03926-5
PMID:39389990
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11467320/
Abstract

While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-grained annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained language models to showcase the feasibility of training named entity recognition models. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.

摘要

虽然大语言模型学习语言及其所含信息的合理统计表示,但本体是符号知识表示,能够理想地补充前者。在这个关键交叉点的研究依赖于将本体和文本语料库交织在一起的数据集,以实现神经符号模型的训练和全面基准测试。我们展示了材料挖掘数据集和相关的材料力学本体,其中材料力学领域的本体概念与文献语料库中的文本实体相关联。该数据集的另一个显著特点是其极其精细的注释。具体而言,在四篇出版物中,由三名评分者手动注释了179个不同的类别,总计有2191个实体经过注释和整理。提出了用于因果组成 - 过程 - 微观结构 - 属性关系的符号表示的概念性工作。我们探讨了三名评分者之间的注释一致性,并对预训练语言模型进行微调,以展示训练命名实体识别模型的可行性。重用该数据集可以促进材料语言模型的训练和基准测试、自动本体构建以及从文本数据生成知识图谱。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/b44c5a6bd4cb/41597_2024_3926_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/25af8b7b0809/41597_2024_3926_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/21772eef9cef/41597_2024_3926_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/a659b0c91e4e/41597_2024_3926_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/b875bf249ea3/41597_2024_3926_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/a45cc918a23b/41597_2024_3926_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/164a74f55321/41597_2024_3926_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/b44c5a6bd4cb/41597_2024_3926_Fig7_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/25af8b7b0809/41597_2024_3926_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/21772eef9cef/41597_2024_3926_Fig2_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/a659b0c91e4e/41597_2024_3926_Fig3_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/b875bf249ea3/41597_2024_3926_Fig4_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/a45cc918a23b/41597_2024_3926_Fig5_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/164a74f55321/41597_2024_3926_Fig6_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/415e/11467320/b44c5a6bd4cb/41597_2024_3926_Fig7_HTML.jpg

相似文献

1
An ontology-based text mining dataset for extraction of process-structure-property entities.一个用于提取过程-结构-属性实体的基于本体的文本挖掘数据集。
Sci Data. 2024 Oct 10;11(1):1112. doi: 10.1038/s41597-024-03926-5.
2
Automatic knowledge extraction from Chinese electronic medical records and rheumatoid arthritis knowledge graph construction.从中国电子病历中自动提取知识并构建类风湿性关节炎知识图谱。
Quant Imaging Med Surg. 2023 Jun 1;13(6):3873-3890. doi: 10.21037/qims-22-1158. Epub 2023 May 8.
3
An annotated corpus of clinical trial publications supporting schema-based relational information extraction.支持基于模式的关系信息抽取的临床试验文献标注语料库。
J Biomed Semantics. 2022 May 23;13(1):14. doi: 10.1186/s13326-022-00271-7.
4
TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature.TaeC:一个用于小麦育种文献中性状和表型提取以及实体链接的人工注释文本数据集。
PLoS One. 2024 Jun 13;19(6):e0305475. doi: 10.1371/journal.pone.0305475. eCollection 2024.
5
Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study.用于命名实体识别任务的大语言模型微调的样本量考量:方法学研究
JMIR AI. 2024 May 16;3:e52095. doi: 10.2196/52095.
6
BertSRC: transformer-based semantic relation classification.BertSRC:基于转换器的语义关系分类。
BMC Med Inform Decis Mak. 2022 Sep 6;22(1):234. doi: 10.1186/s12911-022-01977-5.
7
NERO: a biomedical named-entity (recognition) ontology with a large, annotated corpus reveals meaningful associations through text embedding.NERO:一个具有大型标注语料库的生物医学命名实体(识别)本体,通过文本嵌入揭示有意义的关联。
NPJ Syst Biol Appl. 2021 Oct 20;7(1):38. doi: 10.1038/s41540-021-00200-x.
8
Assessment of disease named entity recognition on a corpus of annotated sentences.基于带注释句子语料库的疾病命名实体识别评估。
BMC Bioinformatics. 2008 Apr 11;9 Suppl 3(Suppl 3):S3. doi: 10.1186/1471-2105-9-S3-S3.
9
Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation.使用图查询搜索 COVID-19 临床研究:算法开发与验证。
J Med Internet Res. 2024 May 30;26:e52655. doi: 10.2196/52655.
10
EnzChemRED, a rich enzyme chemistry relation extraction dataset.EnzChemRED,一个丰富的酶化学关系提取数据集。
ArXiv. 2024 Apr 22:arXiv:2404.14209v1.

引用本文的文献

1
Ontology-conformal recognition of materials entities using language models.使用语言模型对材料实体进行本体共形识别。
Sci Rep. 2025 May 28;15(1):18597. doi: 10.1038/s41598-025-03619-y.

本文引用的文献

1
Enhancing corrosion-resistant alloy design through natural language processing and deep learning.通过自然语言处理和深度学习优化耐腐蚀合金设计
Sci Adv. 2023 Aug 11;9(32):eadg7992. doi: 10.1126/sciadv.adg7992.
2
A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing.一种使用自然语言处理从大型聚合物语料库中提取通用材料属性数据的管道。
NPJ Comput Mater. 2023;9(1):52. doi: 10.1038/s41524-023-01003-w. Epub 2023 Apr 5.
3
Tackling overpublishing by moving to open-ended papers.
Nat Mater. 2023 Mar;22(3):270-271. doi: 10.1038/s41563-023-01489-1.
4
Training high-strength aluminum alloys to withstand fatigue.训练高强度铝合金以承受疲劳。
Nat Commun. 2020 Oct 15;11(1):5198. doi: 10.1038/s41467-020-19071-7.
5
Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature.命名实体识别和规范化在材料科学文献的大规模信息抽取中的应用。
J Chem Inf Model. 2019 Sep 23;59(9):3692-3702. doi: 10.1021/acs.jcim.9b00470. Epub 2019 Aug 19.
6
Owlready: Ontology-oriented programming in Python with automatic classification and high level constructs for biomedical ontologies.Owlready:用于生物医学本体的面向本体的Python编程,具备自动分类和高级构造。
Artif Intell Med. 2017 Jul;80:11-28. doi: 10.1016/j.artmed.2017.07.002. Epub 2017 Aug 14.
7
The Protégé Project: A Look Back and a Look Forward.Protégé项目:回顾与展望。
AI Matters. 2015 Jun;1(4):4-12. doi: 10.1145/2757001.2757003.
8
The dual role of coherent twin boundaries in hydrogen embrittlement.相干孪晶界在氢脆中的双重作用。
Nat Commun. 2015 Feb 5;6:6164. doi: 10.1038/ncomms7164.