Endalie Demeke, Haile Getamesay, Taye Abebe Wondmagegn
Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, Ethiopia.
Faculty of Civil and Environmental Engineering, Jimma Institute of Technology, Jimma, Oromia, Ethiopia.
PeerJ Comput Sci. 2022 Apr 25;8:e961. doi: 10.7717/peerj-cs.961. eCollection 2022.
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.
文本分类是指根据文档内容将其归类到一组预定义类别的过程。文本分类算法通常将文档表示为单词集合,并且要处理大量特征。当初始特征集非常大时,选择合适的特征就变得很重要。在本文中,我们提出了一种基于文档频率(DF)和遗传算法(GA)的混合特征选择方法,用于阿姆哈拉语文本分类。我们在从埃塞俄比亚通讯社(ENA)获取的阿姆哈拉语新闻文档上评估了这种特征选择方法。本研究中使用的类别数量为13个。我们的实验结果表明,所提出的特征选择方法优于用于阿姆哈拉语新闻文档分类的其他特征选择方法。将所提出的特征选择方法与极端随机树分类器(ETC)相结合可提高分类准确率。它比DF、信息增益(IG)、卡方检验(CHI)和主成分分析(PCA)的混合方法提高分类准确率高达1%,比GA高2.47%,比DF、IG和CHI的混合方法高3.86%。