Ali Mushtaq, Khan Muzammil, Alharbi Yasser
Department of Computer and Software Technology, University of Swat, Swat, KP, Pakistan.
College of Computer Science and Engineering, University of Hail, Hail, Saudi Arabia.
PeerJ Comput Sci. 2024 Dec 11;10:e2577. doi: 10.7717/peerj-cs.2577. eCollection 2024.
Part-of-speech (POS) tagging is the process of assigning tags or labels to each word of a text based on the grammatical category. It provides the ability to understand the grammatical structure of a text and plays an important role in many natural language processing tasks like syntax understanding, semantic analysis, text processing, information retrieval, machine translation, and named entity recognition. The POS tagging involves sequential nature, context dependency, and labeling of each word. Therefore it is a sequence labeling task. The challenges faced in Urdu text processing including resource scarcity, morphological richness, free word order, absence of capitalization, agglutinative nature, spelling variations, and multipurpose usage of words raise the demand for the development of machine learning automatic POS tagging systems for Urdu. Therefore, a conditional random field (CRF) based supervised POS classifier has been developed for 33 different Urdu POS categories using the language-independent features of Urdu text for the Urdu news dataset MM-POST containing 119,276 tokens of seven different domains including Entertainment, Finance, General, Health, Politics, Science and Sports. An analysis of the proposed approach is presented, proving it superior to other Urdu POS tagging research for using a simpler strategy by employing fewer word-level features as context windows together with the word length. The effective utilization of these features for the POS tagging of Urdu text resulted in the state-of-the-art performance of the CRF model, achieving an overall classification accuracy of 96.1%.
词性标注是根据语法类别为文本中的每个单词分配标签的过程。它提供了理解文本语法结构的能力,并且在许多自然语言处理任务中发挥着重要作用,如句法理解、语义分析、文本处理、信息检索、机器翻译和命名实体识别。词性标注涉及序列性质、上下文依赖性以及对每个单词的标注。因此它是一个序列标注任务。乌尔都语文本处理面临的挑战包括资源稀缺、形态丰富、词序自由、无大写、黏着性、拼写变体以及单词的多用途使用,这就增加了开发用于乌尔都语的机器学习自动词性标注系统的需求。因此,基于条件随机场(CRF)的监督式词性分类器已被开发出来,用于针对包含娱乐、金融、综合、健康、政治、科学和体育等七个不同领域的119,276个词元的乌尔都语新闻数据集MM - POST,为33个不同的乌尔都语词性类别使用乌尔都语文本的语言无关特征。文中对所提出的方法进行了分析,证明其优于其他乌尔都语词性标注研究,因为它通过使用更少的单词级特征作为上下文窗口以及单词长度,采用了更简单的策略。这些特征在乌尔都语文本的词性标注中的有效利用导致了CRF模型的最优性能,实现了96.1%的总体分类准确率。