一种基于条件随机场的方法，用于使用与语言无关的特征进行高精度词性标注。

A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features.

作者信息

Ali Mushtaq, Khan Muzammil, Alharbi Yasser

机构信息

Department of Computer and Software Technology, University of Swat, Swat, KP, Pakistan.

College of Computer Science and Engineering, University of Hail, Hail, Saudi Arabia.

出版信息

PeerJ Comput Sci. 2024 Dec 11;10:e2577. doi: 10.7717/peerj-cs.2577. eCollection 2024.

DOI:10.7717/peerj-cs.2577

PMID:39896371

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11784857/

Abstract

Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a text based on the grammatical category. It provides the ability to understand the grammatical structure of a text and plays an important role in many natural language processing tasks like syntax understanding, semantic analysis, text processing, information retrieval, machine translation, and named entity recognition. The POS tagging involves sequential nature, context dependency, and labeling of each word. Therefore it is a sequence labeling task. The challenges faced in Urdu text processing including resource scarcity, morphological richness, free word order, absence of capitalization, agglutinative nature, spelling variations, and multipurpose usage of words raise the demand for the development of machine learning automatic POS tagging systems for Urdu. Therefore, a conditional random field (CRF) based supervised POS classifier has been developed for 33 different Urdu POS categories using the language-independent features of Urdu text for the Urdu news dataset MM-POST containing 119,276 tokens of seven different domains including Entertainment, Finance, General, Health, Politics, Science and Sports. An analysis of the proposed approach is presented, proving it superior to other Urdu POS tagging research for using a simpler strategy by employing fewer word-level features as context windows together with the word length. The effective utilization of these features for the POS tagging of Urdu text resulted in the state-of-the-art performance of the CRF model, achieving an overall classification accuracy of 96.1%.

摘要

词性标注是根据语法类别为文本中的每个单词分配标签的过程。它提供了理解文本语法结构的能力，并且在许多自然语言处理任务中发挥着重要作用，如句法理解、语义分析、文本处理、信息检索、机器翻译和命名实体识别。词性标注涉及序列性质、上下文依赖性以及对每个单词的标注。因此它是一个序列标注任务。乌尔都语文本处理面临的挑战包括资源稀缺、形态丰富、词序自由、无大写、黏着性、拼写变体以及单词的多用途使用，这就增加了开发用于乌尔都语的机器学习自动词性标注系统的需求。因此，基于条件随机场（CRF）的监督式词性分类器已被开发出来，用于针对包含娱乐、金融、综合、健康、政治、科学和体育等七个不同领域的119,276个词元的乌尔都语新闻数据集MM - POST，为33个不同的乌尔都语词性类别使用乌尔都语文本的语言无关特征。文中对所提出的方法进行了分析，证明其优于其他乌尔都语词性标注研究，因为它通过使用更少的单词级特征作为上下文窗口以及单词长度，采用了更简单的策略。这些特征在乌尔都语文本的词性标注中的有效利用导致了CRF模型的最优性能，实现了96.1%的总体分类准确率。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/c1b2/11784857/dd59f42f2b88/peerj-cs-10-2577-g004.jpg

相似文献

A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features.一种基于条件随机场的方法，用于使用与语言无关的特征进行高精度词性标注。

PeerJ Comput Sci. 2024 Dec 11;10:e2577. doi: 10.7717/peerj-cs.2577. eCollection 2024.

A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text.一个用于临床文本的细粒度中文分词和词性标注语料库。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):66. doi: 10.1186/s12911-019-0770-7.

A deep learning model incorporating part of speech and self-matching attention for named entity recognition of Chinese electronic medical records.基于词性和自匹配注意力的深度学习模型在中文电子病历命名实体识别中的应用。

BMC Med Inform Decis Mak. 2019 Apr 9;19(Suppl 2):65. doi: 10.1186/s12911-019-0762-7.

Improving part-of-speech tagging in Amharic language using deep neural network.使用深度神经网络改进阿姆哈拉语的词性标注

Heliyon. 2023 Jun 21;9(7):e17175. doi: 10.1016/j.heliyon.2023.e17175. eCollection 2023 Jul.

A Data-Driven Model for Automated Chinese Word Segmentation and POS Tagging.基于数据驱动的中文分词与词性标注自动化模型

Comput Intell Neurosci. 2022 Sep 16;2022:7622392. doi: 10.1155/2022/7622392. eCollection 2022.

Roman urdu hate speech detection using hybrid machine learning models and hyperparameter optimization.基于混合机器学习模型和超参数优化的罗马 Urdu 仇恨言论检测

Sci Rep. 2024 Nov 19;14(1):28590. doi: 10.1038/s41598-024-79106-7.

Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach.基于神经网络序列标注方法的乳腺 X 线摄影筛查报告的全面词级分类。

J Digit Imaging. 2019 Oct;32(5):685-692. doi: 10.1007/s10278-018-0141-4.

A deep learning approach for Named Entity Recognition in Urdu language.一种用于乌尔都语命名实体识别的深度学习方法。

PLoS One. 2024 Mar 28;19(3):e0300725. doi: 10.1371/journal.pone.0300725. eCollection 2024.

Cursive-Text: A Comprehensive Dataset for End-to-End Urdu Text Recognition in Natural Scene Images.连笔文本：用于自然场景图像中乌尔都语文本端到端识别的综合数据集。

Data Brief. 2020 May 21;31:105749. doi: 10.1016/j.dib.2020.105749. eCollection 2020 Aug.

A dataset of Roman Urdu text with spelling variations for sentence level sentiment analysis.一个用于句子级情感分析的带有拼写变体的罗马乌尔都语文本数据集。

Data Brief. 2024 Nov 23;57:111170. doi: 10.1016/j.dib.2024.111170. eCollection 2024 Dec.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种基于条件随机场的方法，用于使用与语言无关的特征进行高精度词性标注。

A conditional random field based approach for high-accuracy part-of-speech tagging using language-independent features.

作者信息

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献