基于带有实体类型信息的变换器双向编码器表征的文物中文文本知识图谱补全

Knowledge Graph Completion for the Chinese Text of Cultural Relics Based on Bidirectional Encoder Representations from Transformers with Entity-Type Information.

作者信息

Zhang Min, Geng Guohua, Zeng Sheng, Jia Huaping

机构信息

School of Information Science and Technology, Northwest University, Xi'an 710127, China.

College of Computer, Weinan Normal University, Weinan 714099, China.

出版信息

Entropy (Basel). 2020 Oct 16;22(10):1168. doi: 10.3390/e22101168.

DOI:10.3390/e22101168

PMID:33286937

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7597339/

Abstract

Knowledge graph completion can make knowledge graphs more complete, which is a meaningful research topic. However, the existing methods do not make full use of entity semantic information. Another challenge is that a deep model requires large-scale manually labelled data, which greatly increases manual labour. In order to alleviate the scarcity of labelled data in the field of cultural relics and capture the rich semantic information of entities, this paper proposes a model based on the Bidirectional Encoder Representations from Transformers (BERT) with entity-type information for the knowledge graph completion of the Chinese texts of cultural relics. In this work, the knowledge graph completion task is treated as a classification task, while the entities, relations and entity-type information are integrated as a textual sequence, and the Chinese characters are used as a token unit in which input representation is constructed by summing token, segment and position embeddings. A small number of labelled data are used to pre-train the model, and then, a large number of unlabelled data are used to fine-tune the pre-training model. The experiment results show that the BERT-KGC model with entity-type information can enrich the semantics information of the entities to reduce the degree of ambiguity of the entities and relations to some degree and achieve more effective performance than the baselines in triple classification, link prediction and relation prediction tasks using 35% of the labelled data of cultural relics.

摘要

知识图谱补全可以使知识图谱更加完整，这是一个有意义的研究课题。然而，现有方法没有充分利用实体语义信息。另一个挑战是深度模型需要大规模的人工标注数据，这大大增加了人工工作量。为了缓解文物领域标注数据稀缺的问题并捕捉实体的丰富语义信息，本文提出了一种基于具有实体类型信息的变换器双向编码器表征（BERT）的模型，用于文物中文文本的知识图谱补全。在这项工作中，知识图谱补全任务被视为一个分类任务，同时将实体、关系和实体类型信息整合为一个文本序列，并将汉字用作令牌单元，其中输入表征通过对令牌、片段和位置嵌入求和来构建。使用少量标注数据对模型进行预训练，然后使用大量未标注数据对预训练模型进行微调。实验结果表明，具有实体类型信息的BERT-KGC模型可以丰富实体的语义信息，在一定程度上降低实体和关系的模糊度，并且在使用35%的文物标注数据的三元组分类、链接预测和关系预测任务中，比基线模型具有更有效的性能。