School of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, 710049, PR China; School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, PR China.
School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, 710049, PR China.
Comput Biol Med. 2017 Oct 1;89:264-274. doi: 10.1016/j.compbiomed.2017.08.021. Epub 2017 Aug 24.
A filter feature selection technique has been widely used to mine biomedical data. Recently, in the classical filter method minimal-Redundancy-Maximal-Relevance (mRMR), a risk has been revealed that a specific part of the redundancy, called irrelevant redundancy, may be involved in the minimal-redundancy component of this method. Thus, a few attempts to eliminate the irrelevant redundancy by attaching additional procedures to mRMR, such as Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), have been made. In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) and Gram-Schmidt Orthogonalization (GSO), named Orthogonal MIC Feature Selection (OMICFS), was proposed to solve this problem. Different from other improved approaches under the max-relevance and min-redundancy criterion, in the proposed method, the MIC is used to quantify the degree of relevance between feature variables and target variable, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected features, and the max-relevance and min-redundancy can be indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target. This orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy without any additional procedures. To verify the performance, OMICFS was compared with other filter feature selection methods in terms of both classification accuracy and computational efficiency by conducting classification experiments on two types of biomedical datasets. The results showed that OMICFS outperforms the other methods in most cases. In addition, differences between these methods were analyzed, and the application of OMICFS in the mining of high-dimensional biomedical data was discussed. The Matlab code for the proposed method is available at https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/.
一种过滤特征选择技术已被广泛用于挖掘生物医学数据。最近,在经典的过滤方法最小冗余最大相关性(mRMR)中,已经发现了一个风险,即冗余的一个特定部分,称为不相关冗余,可能涉及该方法的最小冗余部分。因此,已经尝试通过向 mRMR 添加附加程序来消除不相关冗余,例如基于核典型相关分析的 mRMR(KCCAmRMR)。在本研究中,提出了一种基于最大信息系数(MIC)和 Gram-Schmidt 正交化(GSO)的新型过滤特征选择方法,称为正交 MIC 特征选择(OMICFS),以解决这个问题。与最大相关性和最小冗余准则下的其他改进方法不同,在所提出的方法中,MIC 用于量化特征变量与目标变量之间的相关性程度,GSO 用于计算候选特征相对于先前选择的特征的正交化变量,并且可以通过最大化 GSO 正交化变量与目标之间的 MIC 相关性来间接优化最大相关性和最小冗余性。这种正交化策略允许 OMICFS 在不使用任何附加程序的情况下排除不相关的冗余。为了验证性能,通过在两种类型的生物医学数据集上进行分类实验,将 OMICFS 与其他过滤特征选择方法在分类准确性和计算效率方面进行了比较。结果表明,在大多数情况下,OMICFS 优于其他方法。此外,还分析了这些方法之间的差异,并讨论了 OMICFS 在挖掘高维生物医学数据中的应用。该方法的 Matlab 代码可在 https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/ 获得。