McTwo：一种基于最大信息系数的两步特征选择算法。

BACKGROUND: High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This "large p, small n" paradigm in the area of biomedical "big data" may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets. RESULTS: This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature. CONCLUSION: McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.

背景：高通量生物组学技术正以越来越快的速度从生物样本中产生高维数据，而由于各种困难，传统实验中的训练样本数量仍然很少。生物医学“大数据”领域的这种“高维小样本”范式至少可以部分地通过特征选择算法来解决，这些算法只选择与表型显著相关的特征。特征选择是一个NP难问题。由于寻找全局最优解的时间要求呈指数增长，所有现有的特征选择算法都采用启发式规则来寻找局部最优解，并且它们的解在不同的数据集上表现不同。结果：这项工作描述了一种基于最近发表的相关性度量——最大信息系数（MIC）的特征选择算法。所提出的算法McTwo旨在选择与表型相关的、相互独立的特征，并实现最近邻算法的高分类性能。基于对17个数据集的比较研究，McTwo的性能与现有算法相当或更好，同时所选特征的数量显著减少。从文献来看，McTwo选择的特征似乎也与表型具有特定的生物医学相关性。结论：McTwo选择了一个具有非常好的分类性能且特征数量少的特征子集。因此，McTwo可能代表了一种用于高维生物医学数据集的补充性特征选择算法。

新学期，新优惠

Suppr 超能文献

新学期，新优惠

Suppr 超能文献

McTwo: a two-step feature selection algorithm based on maximal information coefficient.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

推荐工具