Durmaz Ali Riza, Thomas Akhil, Mishra Lokesh, Murthy Rachana Niranjan, Straub Thomas
Fraunhofer Institute for Mechanics of Materials IWM, Freiburg im Breisgau, 79108, Germany.
University of Freiburg, Freiburg, 79098, Germany.
Sci Data. 2024 Oct 10;11(1):1112. doi: 10.1038/s41597-024-03926-5.
While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-grained annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained language models to showcase the feasibility of training named entity recognition models. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
虽然大语言模型学习语言及其所含信息的合理统计表示,但本体是符号知识表示,能够理想地补充前者。在这个关键交叉点的研究依赖于将本体和文本语料库交织在一起的数据集,以实现神经符号模型的训练和全面基准测试。我们展示了材料挖掘数据集和相关的材料力学本体,其中材料力学领域的本体概念与文献语料库中的文本实体相关联。该数据集的另一个显著特点是其极其精细的注释。具体而言,在四篇出版物中,由三名评分者手动注释了179个不同的类别,总计有2191个实体经过注释和整理。提出了用于因果组成 - 过程 - 微观结构 - 属性关系的符号表示的概念性工作。我们探讨了三名评分者之间的注释一致性,并对预训练语言模型进行微调,以展示训练命名实体识别模型的可行性。重用该数据集可以促进材料语言模型的训练和基准测试、自动本体构建以及从文本数据生成知识图谱。