Agbesi Victor Kwaku, Chen Wenyu, Yussif Sophyani Banaamwini, Ukwuoma Chiagoziem C, Gu Yeong Hyeon, Al-Antari Mugahed A
School of Computer Science and Engineering, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu, Sichuan, China.
Sichuan Engineering Technology Research Center for Industrial Internet Intelligent Monitoring and Application, Chengdu University of Technology, No. 610059, Chengdu, Sichuan, China.
Heliyon. 2024 Sep 30;10(19):e38515. doi: 10.1016/j.heliyon.2024.e38515. eCollection 2024 Oct 15.
Feature extraction plays a critical role in text classification, as it converts textual data into numerical representations suitable for machine learning models. A key challenge lies in effectively capturing both semantic and contextual information from text at various levels of granularity while avoiding overfitting. Prior methods have often demonstrated suboptimal performance, largely due to the limitations of the feature extraction techniques employed. To address these challenges, this study introduces Multi-TextCNN, an advanced feature extractor designed to capture essential textual information across multiple levels of granularity. Multi-TextCNN is integrated into a proposed classification model named MuTCELM, which aims to enhance text classification performance. The proposed MuTCELM leverages five distinct sub-classifiers, each designed to capture different linguistic features from the text data. These sub-classifiers are integrated into an ensemble framework, enhancing the overall model performance by combining their complementary strengths. Empirical results indicate that MuTCELM achieves average improvements across all datasets in accuracy, precision, recall, and F1-macro scores by 0.2584, 0.2546, 0.2668, and 0.2612, respectively, demonstrating significant performance gains over baseline models. These findings underscore the effectiveness of Multi-TextCNN in improving model performance relative to other ensemble methods. Further analysis reveals that the non-overlapping confidence intervals between MuTCELM and baseline models indicate statistically significant differences, suggesting that the observed performance improvements of MuTCELM are not attributable to random chance but are indeed statistically meaningful. This evidence indicates the robustness and superiority of MuTCELM across various languages and text classification tasks.
特征提取在文本分类中起着至关重要的作用,因为它将文本数据转换为适合机器学习模型的数值表示。一个关键挑战在于如何在不同粒度级别上有效地从文本中捕获语义和上下文信息,同时避免过拟合。先前的方法往往表现出次优性能,这主要是由于所采用的特征提取技术的局限性。为了应对这些挑战,本研究引入了Multi-TextCNN,这是一种先进的特征提取器,旨在跨多个粒度级别捕获重要的文本信息。Multi-TextCNN被集成到一个名为MuTCELM的分类模型中,该模型旨在提高文本分类性能。所提出的MuTCELM利用五个不同的子分类器,每个子分类器旨在从文本数据中捕获不同的语言特征。这些子分类器被集成到一个集成框架中,通过结合它们的互补优势来提高整体模型性能。实证结果表明,MuTCELM在所有数据集上的准确率、精确率、召回率和F1宏分数平均分别提高了0.2584、0.2546、0.2668和0.2612,相对于基线模型显示出显著的性能提升。这些发现强调了Multi-TextCNN相对于其他集成方法在提高模型性能方面的有效性。进一步分析表明,MuTCELM与基线模型之间不重叠的置信区间表明存在统计学上的显著差异,这表明观察到的MuTCELM性能提升并非偶然,而是具有统计学意义。这一证据表明了MuTCELM在各种语言和文本分类任务中的稳健性和优越性。