Zhang Min, Geng Guohua, Chen Jing
School of Information Science and Technology, Northwest University, Xi'an 710127, China.
School of Engineering and Technology, Xi'an Fanyi University, 710105 Xi'an, China.
Entropy (Basel). 2020 Feb 22;22(2):252. doi: 10.3390/e22020252.
Increasingly, popular online museums have significantly changed the way people acquire cultural knowledge. These online museums have been generating abundant amounts of cultural relics data. In recent years, researchers have used deep learning models that can automatically extract complex features and have rich representation capabilities to implement named-entity recognition (NER). However, the lack of labeled data in the field of cultural relics makes it difficult for deep learning models that rely on labeled data to achieve excellent performance. To address this problem, this paper proposes a semi-supervised deep learning model named SCRNER (Semi-supervised model for Cultural Relics' Named Entity Recognition) that utilizes the bidirectional long short-term memory (BiLSTM) and conditional random fields (CRF) model trained by seldom labeled data and abundant unlabeled data to attain an effective performance. To satisfy the semi-supervised sample selection, we propose a repeat-labeled (relabeled) strategy to select samples of high confidence to enlarge the training set iteratively. In addition, we use embeddings from language model (ELMo) representations to dynamically acquire word representations as the input of the model to solve the problem of the blurred boundaries of cultural objects and Chinese characteristics of texts in the field of cultural relics. Experimental results demonstrate that our proposed model, trained on limited labeled data, achieves an effective performance in the task of named entity recognition of cultural relics.
越来越多的热门在线博物馆显著改变了人们获取文化知识的方式。这些在线博物馆生成了大量的文物数据。近年来,研究人员使用能够自动提取复杂特征且具有丰富表示能力的深度学习模型来实现命名实体识别(NER)。然而,文物领域缺乏标注数据使得依赖标注数据的深度学习模型难以取得优异的性能。为了解决这个问题,本文提出了一种名为SCRNER(文物命名实体识别半监督模型)的半监督深度学习模型,该模型利用双向长短期记忆(BiLSTM)和由少量标注数据及大量未标注数据训练的条件随机场(CRF)模型来获得有效的性能。为了满足半监督样本选择,我们提出一种重复标注(重新标注)策略,以选择高置信度的样本,从而迭代地扩大训练集。此外,我们使用语言模型(ELMo)表示的嵌入来动态获取词表示作为模型的输入,以解决文物领域中文物边界模糊和文本具有中国特色的问题。实验结果表明,我们提出的模型在有限的标注数据上进行训练,在文物命名实体识别任务中取得了有效的性能。