IEEE/ACM Trans Comput Biol Bioinform. 2019 Nov-Dec;16(6):1802-1815. doi: 10.1109/TCBB.2018.2833482. Epub 2018 May 9.
DNA microarray datasets are characterized by a large number of features with very few samples, which is a typical cause of overfitting and poor generalization in the classification task. Here, we introduce a novel feature selection (FS) approach which employs the distance correlation (dCor) as a criterion for evaluating the dependence of the class on a given feature subset. The dCor index provides a reliable dependence measure among random vectors of arbitrary dimension, without any assumption on their distribution. Moreover, it is sensitive to the presence of redundant terms. The proposed FS method is based on a probabilistic representation of the feature subset model, which is progressively refined by a repeated process of model extraction and evaluation. A key element of the approach is a distributed optimization scheme based on a vertical partitioning of the dataset, which alleviates the negative effects of its unbalanced dimensions. The proposed method has been tested on several microarray datasets, resulting in quite compact and accurate models obtained at a reasonable computational cost.
DNA 微阵列数据集的特点是特征数量非常多,而样本数量非常少,这是分类任务中过度拟合和泛化能力差的一个典型原因。在这里,我们引入了一种新的特征选择(FS)方法,该方法使用距离相关(dCor)作为评估给定特征子集与类之间依赖关系的标准。dCor 指数提供了一种可靠的依赖度量,适用于任意维度的随机向量,而无需对其分布做出任何假设。此外,它对冗余项的存在很敏感。所提出的 FS 方法基于特征子集模型的概率表示,该模型通过重复的模型提取和评估过程进行逐步细化。该方法的一个关键要素是基于数据集垂直划分的分布式优化方案,它减轻了其不平衡维度的负面影响。该方法已经在多个微阵列数据集上进行了测试,结果得到了相当紧凑和准确的模型,同时计算成本也合理。