Suppr超能文献

基于句子信息增强与特征融合的中文文本分类方法

Chinese text classification method based on sentence information enhancement and feature fusion.

作者信息

Zhu Binglin, Pan Wei

机构信息

School of Computer Science, China West Normal University, China.

出版信息

Heliyon. 2024 Aug 24;10(17):e36861. doi: 10.1016/j.heliyon.2024.e36861. eCollection 2024 Sep 15.

Abstract

Text classification involves annotating text data with specific labels and is a crucial research task in the field of natural language processing. Chinese text classification presents significant challenges due to the complex semantics of the language, difficulties in semantic feature extraction, and the interleaving and irregularity of lexical features. Traditional methods often struggle to manage the relationships between words and sentences in Chinese, hindering the model's ability to capture deep semantic information and resulting in poor classification performance. To address these issues, a Chinese text classification method based on utterance information enhancement and feature fusion is proposed. This method first embeds the text into a unified space and obtains feature representations of word vectors and sentence vectors using the BERT (Bidirectional Encoder Representations from Transformers) pre-trained language model. Subsequently, an utterance information enhancement module is constructed to perform syntactic enhancement and feature extraction on the sentence information within the text. Additionally, a feature fusion strategy is introduced to combine the enhanced sentence-level information features with the word-level features extracted by the Bi-GRU (Bidirectional Gated Recurrent Unit network), culminating in the classification output. This approach effectively enhances the feature representation of Chinese text and significantly filters out irrelevant and noisy information. Evaluations on several Chinese datasets demonstrate that the proposed method surpasses existing mainstream classification models in terms of classification accuracy and F1 value, validating its effectiveness and feasibility.

摘要

文本分类涉及用特定标签对文本数据进行标注,是自然语言处理领域一项至关重要的研究任务。由于中文语言语义复杂、语义特征提取困难以及词汇特征的交错性和不规则性,中文文本分类面临重大挑战。传统方法往往难以处理中文中词与句之间的关系,阻碍了模型捕捉深层语义信息的能力,导致分类性能不佳。为解决这些问题,提出了一种基于话语信息增强和特征融合的中文文本分类方法。该方法首先将文本嵌入到统一空间,使用BERT(来自Transformer的双向编码器表示)预训练语言模型获得词向量和句向量的特征表示。随后,构建一个话语信息增强模块,对文本中的句子信息进行句法增强和特征提取。此外,引入一种特征融合策略,将增强后的句子级信息特征与双向门控循环单元网络(Bi-GRU)提取的词级特征相结合,最终得到分类输出。这种方法有效地增强了中文文本的特征表示,并显著过滤掉无关和噪声信息。在几个中文数据集上的评估表明,该方法在分类准确率和F1值方面优于现有的主流分类模型,验证了其有效性和可行性。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f962/11408784/9a0f82390667/gr1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验