School of Computer and Communication Engineering, University of Science and Technology Beijing, Haidian, Beijing, China.
Beijing Key Laboratory of Knowledge Engineering for Materials Science, University of Science and Technology Beijing, Haidian, Beijing, China.
PLoS One. 2023 Oct 12;18(10):e0292582. doi: 10.1371/journal.pone.0292582. eCollection 2023.
Text pre-processing is an important component of a Chinese text classification. At present, however, most of the studies on this topic focus on exploring the influence of preprocessing methods on a few text classification algorithms using English text. In this paper we experimentally compared fifteen commonly used classifiers on two Chinese datasets using three widely used Chinese preprocessing methods that include word segmentation, Chinese specific stop word removal, and Chinese specific symbol removal. We then explored the influence of the preprocessing methods on the final classifications according to various conditions such as classification evaluation, combination style, and classifier selection. Finally, we conducted a battery of various additional experiments, and found that most of the classifiers improved in performance after proper preprocessing was applied. Our general conclusion is that the systematic use of preprocessing methods can have a positive impact on the classification of Chinese short text, using classification evaluation such as macro-F1, combination of preprocessing methods such as word segmentation, Chinese specific stop word and symbol removal, and classifier selection such as machine and deep learning models. We find that the best macro-f1s for categorizing text for the two datasets are 92.13% and 91.99%, which represent improvements of 0.3% and 2%, respectively over the compared baselines.
文本预处理是中文文本分类的重要组成部分。然而,目前大多数关于这个主题的研究都集中在使用英文文本探索预处理方法对几种文本分类算法的影响。在本文中,我们使用三种广泛使用的中文预处理方法(分词、中文特有停用词去除和中文特有符号去除),在两个中文数据集上对十五种常用分类器进行了实验比较。然后,我们根据分类评估、组合方式和分类器选择等各种条件,探讨了预处理方法对最终分类的影响。最后,我们进行了一系列其他的额外实验,发现大多数分类器在适当的预处理后性能得到了提高。我们的总体结论是,系统地使用预处理方法可以对中文短文本的分类产生积极的影响,使用分类评估(如宏 F1)、预处理方法的组合(如分词、中文特有停用词和符号去除)以及分类器的选择(如机器学习和深度学习模型)。我们发现,两个数据集的最佳宏 F1 值分别为 92.13%和 91.99%,分别比比较基线提高了 0.3%和 2%。