School of Computer Science and Technology, Anhui University, Hefei 230601, China.
Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
Int J Mol Sci. 2018 Oct 30;19(11):3398. doi: 10.3390/ijms19113398.
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
(1) 背景:基因表达数据通常包含缺失值 (MVs)。在过去的几年中,已经提出了许多专注于如何估计 MVs 的方法。最近的研究表明,这些插补算法在分类方面几乎没有差异。因此,一些学者认为,如何选择有信息的基因进行下游分类比如何插补 MVs 更重要。然而,大多数特征选择 (FS) 算法需要事先插补,并且事先 MV 插补对下游 FS 性能的影响很少被考虑。(2) 方法:为基因表达数据引入了一种基于卡方检验的修改后的 FS。为了应对基因表达数据样本量小的挑战,本研究提出了一种称为递归元素聚合的启发式方法。我们的方法可以直接处理不完整的数据,而无需任何插补方法或缺失数据假设。通过阈值可以选择最具信息量的基因。之后,使用最佳优先搜索策略找到用于分类的最优特征子集。(3) 结果:我们将我们的方法与几种 FS 算法进行了比较。在 12 个原始不完整癌症基因表达数据集上进行了评估。我们证明了在不完整数据集上进行 MV 插补会影响后续分类任务的 FS。通过直接在不完整数据上进行 FS,我们的方法可以避免 MV 插补对后续 FS 过程可能造成的潜在干扰。在小圆形蓝色细胞瘤 (SRBCT) 数据集上的实验表明,除了与两种比较方法共有的许多常见基因外,我们的方法还发现了其他基因。