Suppr超能文献

通过将文档频率与遗传算法相结合进行阿姆哈拉语文本分类的特征选择

Feature selection by integrating document frequency with genetic algorithm for Amharic news document classification.

作者信息

Endalie Demeke, Haile Getamesay, Taye Abebe Wondmagegn

机构信息

Faculty of Computing and Informatics, Jimma Institute of Technology, Jimma, Oromia, Ethiopia.

Faculty of Civil and Environmental Engineering, Jimma Institute of Technology, Jimma, Oromia, Ethiopia.

出版信息

PeerJ Comput Sci. 2022 Apr 25;8:e961. doi: 10.7717/peerj-cs.961. eCollection 2022.

Abstract

Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.

摘要

文本分类是指根据文档内容将其归类到一组预定义类别的过程。文本分类算法通常将文档表示为单词集合,并且要处理大量特征。当初始特征集非常大时,选择合适的特征就变得很重要。在本文中,我们提出了一种基于文档频率(DF)和遗传算法(GA)的混合特征选择方法,用于阿姆哈拉语文本分类。我们在从埃塞俄比亚通讯社(ENA)获取的阿姆哈拉语新闻文档上评估了这种特征选择方法。本研究中使用的类别数量为13个。我们的实验结果表明,所提出的特征选择方法优于用于阿姆哈拉语新闻文档分类的其他特征选择方法。将所提出的特征选择方法与极端随机树分类器(ETC)相结合可提高分类准确率。它比DF、信息增益(IG)、卡方检验(CHI)和主成分分析(PCA)的混合方法提高分类准确率高达1%,比GA高2.47%,比DF、IG和CHI的混合方法高3.86%。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/8721/9137894/795e1982f640/peerj-cs-08-961-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验