Ge Ruiquan, Zhou Manli, Luo Youxi, Meng Qinghan, Mai Guoqin, Ma Dongli, Wang Guoqing, Zhou Fengfeng
Shenzhen Institutes of Advanced Technology, and Key Lab for Health Informatics, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, Guangdong, 518055, P.R. China.
Shenzhen College of Advanced Technology, University of Chinese Academy of Sciences, Shenzhen, Guangdong, 518055, P.R. China.
BMC Bioinformatics. 2016 Mar 23;17:142. doi: 10.1186/s12859-016-0990-0.
BACKGROUND: High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This "large p, small n" paradigm in the area of biomedical "big data" may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets. RESULTS: This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature. CONCLUSION: McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.
背景:高通量生物组学技术正以越来越快的速度从生物样本中产生高维数据,而由于各种困难,传统实验中的训练样本数量仍然很少。生物医学“大数据”领域的这种“高维小样本”范式至少可以部分地通过特征选择算法来解决,这些算法只选择与表型显著相关的特征。特征选择是一个NP难问题。由于寻找全局最优解的时间要求呈指数增长,所有现有的特征选择算法都采用启发式规则来寻找局部最优解,并且它们的解在不同的数据集上表现不同。 结果:这项工作描述了一种基于最近发表的相关性度量——最大信息系数(MIC)的特征选择算法。所提出的算法McTwo旨在选择与表型相关的、相互独立的特征,并实现最近邻算法的高分类性能。基于对17个数据集的比较研究,McTwo的性能与现有算法相当或更好,同时所选特征的数量显著减少。从文献来看,McTwo选择的特征似乎也与表型具有特定的生物医学相关性。 结论:McTwo选择了一个具有非常好的分类性能且特征数量少的特征子集。因此,McTwo可能代表了一种用于高维生物医学数据集的补充性特征选择算法。
BMC Bioinformatics. 2016-3-23
JMIR Mhealth Uhealth. 2021-9-2
J Biomed Inform. 2009-7-30
BMC Bioinformatics. 2023-10-18
IEEE J Biomed Health Inform. 2018-9-28
Cell Syst. 2024-9-18
PeerJ Comput Sci. 2023-2-13
Comb Chem High Throughput Screen. 2024
Sensors (Basel). 2023-1-31
Comput Intell Neurosci. 2022
Front Genet. 2022-4-27
J Am Med Inform Assoc. 2014-5-22
J Theor Biol. 2014-12-7
Bioinformatics. 2014-7-1
Genomics. 2013-11-13
Genetics. 2013-5-11
J Cereb Blood Flow Metab. 2012-3-28