Zhang Hengyi
College of Animal Science and Technology, Northwest A&F University, Yangling, China.
Front Genet. 2021 Mar 30;12:631505. doi: 10.3389/fgene.2021.631505. eCollection 2021.
Classification is widely used in gene expression data analysis. Feature selection is usually performed before classification because of the large number of genes and the small sample size in gene expression data. In this article, a novel feature selection algorithm using approximate conditional entropy based on fuzzy information granule is proposed, and the correctness of the method is proved by the monotonicity of entropy. Firstly, the fuzzy relation matrix is established by Laplacian kernel. Secondly, the approximately equal relation on fuzzy sets is defined. And then, the approximate conditional entropy based on fuzzy information granule and the importance of internal attributes are defined. Approximate conditional entropy can measure the uncertainty of knowledge from two different perspectives of information and algebra theory. Finally, the greedy algorithm based on the approximate conditional entropy is designed for feature selection. Experimental results for six large-scale gene datasets show that our algorithm not only greatly reduces the dimension of the gene datasets, but also is superior to five state-of-the-art algorithms in terms of classification accuracy.
分类在基因表达数据分析中被广泛应用。由于基因表达数据中基因数量众多且样本量小,特征选择通常在分类之前进行。本文提出了一种基于模糊信息粒的近似条件熵的新型特征选择算法,并通过熵的单调性证明了该方法的正确性。首先,利用拉普拉斯核建立模糊关系矩阵。其次,定义模糊集上的近似相等关系。然后,定义基于模糊信息粒的近似条件熵和内部属性的重要性。近似条件熵可以从信息和代数理论的两个不同角度衡量知识的不确定性。最后,设计了基于近似条件熵的贪心算法进行特征选择。对六个大规模基因数据集的实验结果表明,我们的算法不仅大大降低了基因数据集的维度,而且在分类准确率方面优于五种先进算法。