基于数据驱动的中文分词与词性标注自动化模型

A Data-Driven Model for Automated Chinese Word Segmentation and POS Tagging.

机构信息

Changsha University of Science and Technology, Changsha, Hunan 410000, China.

School of Electronic Communication and Electrical Engineering, Changsha University, Changsha, Hunan 410000, China.

出版信息

Comput Intell Neurosci. 2022 Sep 16;2022:7622392. doi: 10.1155/2022/7622392. eCollection 2022.

DOI:10.1155/2022/7622392

PMID:36156940

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9507729/

Abstract

Chinese natural language processing tasks often require the solution of Chinese word segmentation and POS tagging problems. Traditional Chinese word segmentation and POS tagging methods mainly use simple matching algorithms based on lexicons and rules. The simple matching or statistical analysis requires manual word segmentation followed by POS tagging, which leads to the inability to meet the practical requirements for label prediction accuracy. With the continuous development of deep learning technology, data-driven machine learning models provide new opportunities for automated Chinese word segmentation and POS tagging. Therefore, a data-driven automated Chinese word segmentation and POS tagging model is proposed in order to address the above problems. Firstly, the main idea and overall framework of the proposed automated model are outlined, and the tagging strategy and neural network language model used are described. Secondly, two main optimisations are made on the input side of the model: (1) the use of word2Vec for the representation of text features, thus representing the text as a distributed word vector; and (2) the use of an improved AlexNet for efficient encoding of long-range word, and the addition of an attention mechanism to the model. Finally, on the output side, an additional auxiliary loss function was designed to optimise the Chinese text based on its frequency. The experimental results show that the proposed model can significantly improve the accuracy and operational efficiency of Chinese word segmentation and POS tagging compared with other existing models, thus verifying its effectiveness and advancement.

摘要

中文自然语言处理任务通常需要解决中文分词和词性标注问题。传统的中文分词和词性标注方法主要使用基于词典和规则的简单匹配算法。简单匹配或统计分析需要手动分词，然后进行词性标注，这导致无法满足标签预测准确性的实际要求。随着深度学习技术的不断发展，数据驱动的机器学习模型为自动化中文分词和词性标注提供了新的机会。因此，提出了一种数据驱动的自动化中文分词和词性标注模型，以解决上述问题。首先，概述了所提出的自动化模型的主要思想和整体框架，并描述了所使用的标注策略和神经网络语言模型。其次，在模型的输入侧进行了两项主要优化：（1）使用 word2Vec 表示文本特征，从而将文本表示为分布式单词向量；（2）使用改进的 AlexNet 对长距离单词进行高效编码，并在模型中添加注意力机制。最后，在输出端，设计了一个额外的辅助损失函数，以便根据文本的频率对中文文本进行优化。实验结果表明，与其他现有模型相比，所提出的模型可以显著提高中文分词和词性标注的准确性和运行效率，从而验证了其有效性和先进性。