• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

越南推文中用于命名实体识别的文本归一化

Text normalization for named entity recognition in Vietnamese tweets.

作者信息

Nguyen Vu H, Nguyen Hien T, Snasel Vaclav

机构信息

Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, Vietnam.

Faculty of Electrical Engineering and Computer Science, VSB-Technical University of Ostrava, Ostrava, Czech Republic.

出版信息

Comput Soc Netw. 2016;3(1):10. doi: 10.1186/s40649-016-0032-0. Epub 2016 Dec 1.

DOI:10.1186/s40649-016-0032-0
PMID:29355207
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5749168/
Abstract

BACKGROUND

Named entity recognition (NER) is a task of detecting named entities in documents and categorizing them to predefined classes, such as person, location, and organization. This paper focuses on tweets posted on Twitter. Since tweets are noisy, irregular, brief, and include acronyms and spelling errors, NER in those tweets is a challenging task. Many approaches have been proposed to deal with this problem in tweets written in English, Germany, Chinese, etc., but none for Vietnamese tweets.

METHODS

We propose a method that normalizes a tweet before taking as an input of a learning model for NER in Vietnamese tweets. The normalization step detects spelling errors in a tweet and corrects them using an improved Dice's coefficient or n-grams. A Support Vector Machine learning algorithm is employed to learn a classifier using six different types of features.

RESULTS AND CONCLUSION

We train our method on a training set consisting of more than 40,000 named entities and evaluate it on a testing set consisting of 3,186 named entities. The experimental results showed that our system achieves state-of-the-art performance with F1 score of 82.13%.

摘要

背景

命名实体识别(NER)是一项在文档中检测命名实体并将其分类到预定义类别的任务,例如人物、地点和组织。本文聚焦于在推特上发布的推文。由于推文存在噪声、不规则、简短,且包含首字母缩略词和拼写错误,因此推文中的命名实体识别是一项具有挑战性的任务。已经提出了许多方法来处理英文、德文、中文等推文中的这个问题,但对于越南语推文却没有。

方法

我们提出了一种方法,该方法在将推文作为越南语推文命名实体识别学习模型的输入之前对其进行规范化。规范化步骤检测推文中的拼写错误,并使用改进的戴斯系数或n元语法对其进行纠正。采用支持向量机学习算法,使用六种不同类型的特征来学习一个分类器。

结果与结论

我们在一个由超过40,000个命名实体组成的训练集上训练我们的方法,并在一个由3,186个命名实体组成的测试集上对其进行评估。实验结果表明,我们的系统以82.13%的F1分数达到了当前的最优性能。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e80/5749168/fec9aebbca49/40649_2016_32_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e80/5749168/fec9aebbca49/40649_2016_32_Fig1_HTML.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5e80/5749168/fec9aebbca49/40649_2016_32_Fig1_HTML.jpg

相似文献

1
Text normalization for named entity recognition in Vietnamese tweets.越南推文中用于命名实体识别的文本归一化
Comput Soc Netw. 2016;3(1):10. doi: 10.1186/s40649-016-0032-0. Epub 2016 Dec 1.
2
Challenges in clinical natural language processing for automated disorder normalization.临床自然语言处理中自动疾病标准化的挑战。
J Biomed Inform. 2015 Oct;57:28-37. doi: 10.1016/j.jbi.2015.07.010. Epub 2015 Jul 14.
3
TaggerOne: joint named entity recognition and normalization with semi-Markov Models.TaggerOne:使用半马尔可夫模型进行联合命名实体识别与归一化
Bioinformatics. 2016 Sep 15;32(18):2839-46. doi: 10.1093/bioinformatics/btw343. Epub 2016 Jun 9.
4
Improving deep learning method for biomedical named entity recognition by using entity definition information.利用实体定义信息改进生物医学命名实体识别的深度学习方法。
BMC Bioinformatics. 2021 Dec 17;22(Suppl 1):600. doi: 10.1186/s12859-021-04236-y.
5
Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach.基于本体的推特消息中医疗命名实体识别的递归神经网络方法。
Int J Environ Res Public Health. 2019 Sep 27;16(19):3628. doi: 10.3390/ijerph16193628.
6
Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria.阿尔茨海默病纳入标准的命名实体识别与规范化
Proc (IEEE Int Conf Healthc Inform). 2023 Jun;2023:558-564. doi: 10.1109/ichi57859.2023.00100. Epub 2023 Dec 11.
7
A comprehensive study of named entity recognition in Chinese clinical text.中文临床文本命名实体识别的综合研究。
J Am Med Inform Assoc. 2014 Sep-Oct;21(5):808-14. doi: 10.1136/amiajnl-2013-002381. Epub 2013 Dec 17.
8
A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。
BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.
9
PASCAL: a pseudo cascade learning framework for breast cancer treatment entity normalization in Chinese clinical text.PASCAL:一种用于中文临床文本中乳腺癌治疗实体规范化的伪级联学习框架。
BMC Med Inform Decis Mak. 2020 Aug 28;20(1):204. doi: 10.1186/s12911-020-01216-9.
10
NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition.NERBio:利用选定的词连接、术语规范化和全局模式来改进生物医学命名实体识别。
BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S11. doi: 10.1186/1471-2105-7-S5-S11.