Factuality of computing and Informatics, Jimma institute of technology, Jimma, Ethiopia.
Factuality of computing, Bahir Dar Institute of Technology, Bahir Dar, Ethiopia.
PLoS One. 2021 May 21;16(5):e0251902. doi: 10.1371/journal.pone.0251902. eCollection 2021.
The volume of Amharic digital documents has grown rapidly in recent years. As a result, automatic document categorization is highly essential. In this paper, we present a novel dimension reduction approach for improving classification accuracy by combining feature selection and feature extraction. The new dimension reduction method utilizes Information Gain (IG), Chi-square test (CHI), and Document Frequency (DF) to select important features and Principal Component Analysis (PCA) to refine the features that have been selected. We evaluate the proposed dimension reduction method with a dataset containing 9 news categories. Our experimental results verified that the proposed dimension reduction method outperforms other methods. Classification accuracy with the new dimension reduction is 92.60%, which is 13.48%, 16.51% and 10.19% higher than with IG, CHI, and DF respectively. Further work is required since classification accuracy still decreases as we reduce the feature size to save computational time.
近年来,阿姆哈拉语数字文档的数量迅速增长。因此,自动文档分类非常重要。在本文中,我们提出了一种新的降维方法,通过结合特征选择和特征提取来提高分类准确性。新的降维方法利用信息增益 (IG)、卡方检验 (CHI) 和文档频率 (DF) 选择重要特征,利用主成分分析 (PCA) 精炼已选择的特征。我们使用包含 9 个新闻类别的数据集评估了所提出的降维方法。实验结果验证了所提出的降维方法优于其他方法。使用新的降维方法的分类准确率为 92.60%,分别比 IG、CHI 和 DF 高出 13.48%、16.51%和 10.19%。由于为了节省计算时间而减小特征大小会导致分类准确性下降,因此还需要进一步的工作。