Chen Yumin, Zhang Zunjun, Zheng Jianzhong, Ma Ying, Xue Yu
College of Computer & Information Engineering, Xiamen University of Technology, Xiamen 361024, China.
Department of Urinary Surgery, The Third Xiamen Hospital of Fujian University of Traditional Chinese Medicine, Xiamen 316000, China.
J Biomed Inform. 2017 Mar;67:59-68. doi: 10.1016/j.jbi.2017.02.007. Epub 2017 Feb 13.
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.
随着生物信息学的发展,基于基因表达数据的肿瘤分类成为癌症诊断中一项重要且有用的技术。由于基因表达数据通常包含数千个基因和少量样本,从基因表达数据中进行基因选择成为肿瘤分类的关键步骤。粗糙集的属性约简已成功应用于基因选择领域,因为它具有数据驱动且无需额外信息的特点。然而,传统粗糙集方法仅处理离散数据。对于包含实值或噪声数据的基因表达数据,通常采用离散预处理,这可能导致分类精度较差。在本文中,我们提出了一种基于邻域粗糙集模型的新型基因选择方法,该方法能够处理实值数据,同时保持原始基因分类信息。此外,本文在邻域粗糙集框架下提出了一种熵度量,用于处理基因表达数据的不确定性和噪声。该度量的使用可以发现紧凑的基因子集。最后,基于邻域粒度和熵度量设计了一种基因选择算法。在两个基因表达数据集上的一些实验表明,所提出的基因选择方法是提高肿瘤分类准确性的有效方法。