Nguyen Vu H, Nguyen Hien T, Snasel Vaclav
Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.
Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic.
Comput Soc Netw. 2016;3(1):10. doi: 10.1186/s40649-016-0032-0. Epub 2016 Dec 1.
Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.
We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.
We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.
命名实体识别(NER)是一项在文档中检测命名实体并将其分类到预定义类别的任务,例如人物、地点和组织。本文聚焦于在推特上发布的推文。由于推文存在噪声、不规则、简短,且包含首字母缩略词和拼写错误,因此推文中的命名实体识别是一项具有挑战性的任务。已经提出了许多方法来处理英文、德文、中文等推文中的这个问题,但对于越南语推文却没有。
我们提出了一种方法,该方法在将推文作为越南语推文命名实体识别学习模型的输入之前对其进行规范化。规范化步骤检测推文中的拼写错误,并使用改进的戴斯系数或n元语法对其进行纠正。采用支持向量机学习算法,使用六种不同类型的特征来学习一个分类器。
我们在一个由超过40,000个命名实体组成的训练集上训练我们的方法,并在一个由3,186个命名实体组成的测试集上对其进行评估。实验结果表明,我们的系统以82.13%的F1分数达到了当前的最优性能。