College of Information Science and Engineering Hunan University Changsha, China.
College of Information Science and Engineering Hunan University Changsha, China.
Comput Biol Chem. 2019 Jun;80:121-127. doi: 10.1016/j.compbiolchem.2019.03.017. Epub 2019 Mar 24.
DNA microarray data has been widely used in cancer research due to the significant advantage helped to successfully distinguish between tumor classes. However, typical gene expression data usually presents a high-dimensional imbalanced characteristic, which poses severe challenge for traditional machine learning methods to construct a robust classifier performing well on both the minority and majority classes. As one of the most successful feature weighting techniques, Relief is considered to particularly suit to handle high-dimensional problems. Unfortunately, almost all relief-based methods have not taken the class imbalance distribution into account. This study identifies that existing Relief-based algorithms may underestimate the features with the discernibility ability of minority classes, and ignore the distribution characteristic of minority class samples. As a result, an additional bias towards being classified into the majority classes can be introduced. To this end, a new method, named imRelief, is proposed for efficiently handling high-dimensional imbalanced gene expression data. imRelief can correct the bias towards to the majority classes, and consider the scattered distributional characteristic of minority class samples in the process of estimating feature weights. This way, imRelief has the ability to reward the features which perform well at separating the minority classes from other classes. Experiments on four microarray gene expression data sets demonstrate the effectiveness of imRelief in both feature weighting and feature subset selection applications.
DNA 微阵列数据由于在成功区分肿瘤类别方面具有显著优势,因此在癌症研究中得到了广泛应用。然而,典型的基因表达数据通常呈现出高维不平衡的特征,这对传统的机器学习方法构建在少数类和多数类上都能很好地执行的稳健分类器构成了严峻挑战。作为最成功的特征加权技术之一,Relief 被认为特别适合处理高维问题。不幸的是,几乎所有基于 Relief 的方法都没有考虑到类不平衡分布。本研究发现,现有的基于 Relief 的算法可能低估了具有少数类可辨别能力的特征,并且忽略了少数类样本的分布特征。结果,可能会引入对分类为多数类的额外偏差。为此,提出了一种名为 imRelief 的新方法,用于有效地处理高维不平衡基因表达数据。imRelief 可以纠正偏向多数类的偏差,并在估计特征权重的过程中考虑少数类样本的分散分布特征。这样,imRelief 就有能力奖励那些在区分少数类和其他类方面表现良好的特征。在四个微阵列基因表达数据集上的实验表明,imRelief 在特征加权和特征子集选择应用中都具有有效性。