Suppr超能文献

越南推文中用于命名实体识别的文本归一化

Text normalization for named entity recognition in Vietnamese tweets.

作者信息

Nguyen Vu H, Nguyen Hien T, Snasel Vaclav

机构信息

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic.

出版信息

Comput Soc Netw. 2016;3(1):10. doi: 10.1186/s40649-016-0032-0. Epub 2016 Dec 1.

Abstract

BACKGROUND

Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.

METHODS

We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.

RESULTS AND CONCLUSION

We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.

摘要

背景

命名实体识别(NER)是一项在文档中检测命名实体并将其分类到预定义类别的任务,例如人物、地点和组织。本文聚焦于在推特上发布的推文。由于推文存在噪声、不规则、简短,且包含首字母缩略词和拼写错误,因此推文中的命名实体识别是一项具有挑战性的任务。已经提出了许多方法来处理英文、德文、中文等推文中的这个问题,但对于越南语推文却没有。

方法

我们提出了一种方法,该方法在将推文作为越南语推文命名实体识别学习模型的输入之前对其进行规范化。规范化步骤检测推文中的拼写错误,并使用改进的戴斯系数或n元语法对其进行纠正。采用支持向量机学习算法,使用六种不同类型的特征来学习一个分类器。

结果与结论

我们在一个由超过40,000个命名实体组成的训练集上训练我们的方法,并在一个由3,186个命名实体组成的测试集上对其进行评估。实验结果表明,我们的系统以82.13%的F1分数达到了当前的最优性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e80/5749168/fec9aebbca49/40649_2016_32_Fig1_HTML.jpg

相似文献

6
Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria.阿尔茨海默病纳入标准的命名实体识别与规范化
Proc (IEEE Int Conf Healthc Inform). 2023 Jun;2023:558-564. doi: 10.1109/ichi57859.2023.00100. Epub 2023 Dec 11.
7
A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验