• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

评估高维小样本数据集的特征选择策略。

Evaluating feature selection strategies for high dimensional, small sample size datasets.

作者信息

Golugula Abhishek, Lee George, Madabhushi Anant

机构信息

Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey 08854, USA.

出版信息

Annu Int Conf IEEE Eng Med Biol Soc. 2011;2011:949-52. doi: 10.1109/IEMBS.2011.6090214.

DOI:10.1109/IEMBS.2011.6090214
PMID:22254468
Abstract

In this work, we analyze and evaluate different strategies for comparing Feature Selection (FS) schemes on High Dimensional (HD) biomedical datasets (e.g. gene and protein expression studies) with a small sample size (SSS). Additionally, we define a new feature, Robustness, specifically for comparing the ability of an FS scheme to be invariant to changes in its training data. While classifier accuracy has been the de facto method for evaluating FS schemes, on account of the curse of dimensionality problem, it might not always be the appropriate measure for HD/SSS datasets. SSS lends the dataset a higher probability of containing data that is not representative of the true distribution of the whole population. However, an ideal FS scheme must be robust enough to produce the same results each time there are changes to the training data. In this study, we employed the robustness performance measure in conjunction with classifier accuracy (measured via the K-Nearest Neighbor and Random Forest classifiers) to quantitatively compare five different FS schemes (T-test, F-test, Kolmogorov-Smirnov Test, Wilks Lambda Test and Wilcoxon Rand Sum Test) on 5 HD/SSS gene and protein expression datasets corresponding to ovarian cancer, lung cancer, bone lesions, celiac disease, and coronary heart disease. Of the five FS schemes compared, the Wilcoxon Rand Sum Test was found to outperform other FS schemes in terms of classification accuracy and robustness. Our results suggest that both classifier accuracy and robustness should be considered when deciding on the appropriate FS scheme for HD/SSS datasets.

摘要

在这项工作中,我们分析和评估了不同的策略,用于在小样本量(SSS)的高维(HD)生物医学数据集(例如基因和蛋白质表达研究)上比较特征选择(FS)方案。此外,我们定义了一个新的特征——稳健性,专门用于比较FS方案对其训练数据变化的不变性能力。虽然分类器准确性一直是评估FS方案的实际方法,但由于维度诅咒问题,它可能并不总是适用于HD/SSS数据集的度量标准。小样本量使得数据集中更有可能包含不代表总体真实分布的数据。然而,一个理想的FS方案必须足够稳健,以便在每次训练数据发生变化时都能产生相同的结果。在本研究中,我们将稳健性性能度量与分类器准确性(通过K近邻和随机森林分类器测量)结合使用,以定量比较五种不同的FS方案(T检验、F检验、柯尔莫哥洛夫-斯米尔诺夫检验、威尔克斯lambda检验和威尔科克森秩和检验)在对应于卵巢癌、肺癌、骨病变、乳糜泻和冠心病的5个HD/SSS基因和蛋白质表达数据集上的表现。在比较的五种FS方案中,发现威尔科克森秩和检验在分类准确性和稳健性方面优于其他FS方案。我们的结果表明,在为HD/SSS数据集确定合适的FS方案时,应同时考虑分类器准确性和稳健性。

相似文献

1
Evaluating feature selection strategies for high dimensional, small sample size datasets.评估高维小样本数据集的特征选择策略。
Annu Int Conf IEEE Eng Med Biol Soc. 2011;2011:949-52. doi: 10.1109/IEMBS.2011.6090214.
2
Cuckoo search optimisation for feature selection in cancer classification: a new approach.用于癌症分类特征选择的布谷鸟搜索优化:一种新方法。
Int J Data Min Bioinform. 2015;13(3):248-65. doi: 10.1504/ijdmb.2015.072092.
3
Toward a measure of classification complexity in gene expression signatures.迈向基因表达特征中分类复杂性的一种度量方法。
Annu Int Conf IEEE Eng Med Biol Soc. 2008;2008:5704-7. doi: 10.1109/IEMBS.2008.4650509.
4
The feature selection bias problem in relation to high-dimensional gene data.与高维基因数据相关的特征选择偏差问题。
Artif Intell Med. 2016 Jan;66:63-71. doi: 10.1016/j.artmed.2015.11.001. Epub 2015 Nov 14.
5
Bayesian network learning with feature abstraction for gene-drug dependency analysis.用于基因-药物依赖性分析的基于特征抽象的贝叶斯网络学习
J Bioinform Comput Biol. 2005 Feb;3(1):61-77. doi: 10.1142/s0219720005000874.
6
Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection.监督式、无监督式和半监督式特征选择:基因选择综述
IEEE/ACM Trans Comput Biol Bioinform. 2016 Sep-Oct;13(5):971-989. doi: 10.1109/TCBB.2015.2478454. Epub 2015 Sep 14.
7
Feature selection and nearest centroid classification for protein mass spectrometry.蛋白质质谱的特征选择与最近质心分类
BMC Bioinformatics. 2005 Mar 23;6:68. doi: 10.1186/1471-2105-6-68.
8
Small, fuzzy and interpretable gene expression based classifiers.基于小的、模糊且可解释的基因表达的分类器。
Bioinformatics. 2005 May 1;21(9):1964-70. doi: 10.1093/bioinformatics/bti287. Epub 2005 Jan 20.
9
A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.基于基因表达的组织分类中特征选择与多类分类方法的比较研究
Bioinformatics. 2004 Oct 12;20(15):2429-37. doi: 10.1093/bioinformatics/bth267. Epub 2004 Apr 15.
10
A Multiobjective Genetic Programming-Based Ensemble for Simultaneous Feature Selection and Classification.基于多目标遗传编程的集成方法用于特征选择和分类的同步。
IEEE Trans Cybern. 2016 Feb;46(2):499-510. doi: 10.1109/TCYB.2015.2404806. Epub 2015 Mar 6.

引用本文的文献

1
Graph convolutional network-based feature selection for high-dimensional and low-sample size data.基于图卷积网络的高维小样本数据特征选择。
Bioinformatics. 2023 Apr 3;39(4). doi: 10.1093/bioinformatics/btad135.
2
Descriptor selection for predicting interfacial thermal resistance by machine learning methods.通过机器学习方法预测界面热阻的描述符选择。
Sci Rep. 2021 Jan 12;11(1):739. doi: 10.1038/s41598-020-80795-z.
3
Digital Mammography in Breast Cancer: Additive Value of Radiomics of Breast Parenchyma.数字乳腺 X 线摄影在乳腺癌中的应用:乳腺实质的放射组学的附加价值。
Radiology. 2019 Apr;291(1):15-20. doi: 10.1148/radiol.2019181113. Epub 2019 Feb 12.
4
A Combined Metabolomic and Proteomic Analysis of Gestational Diabetes Mellitus.妊娠期糖尿病的代谢组学和蛋白质组学联合分析
Int J Mol Sci. 2015 Dec 16;16(12):30034-45. doi: 10.3390/ijms161226133.
5
Quantitative ultrasound image analysis of axillary lymph node status in breast cancer patients.乳腺癌患者腋窝淋巴结状态的定量超声图像分析
Int J Comput Assist Radiol Surg. 2013 Nov;8(6):895-903. doi: 10.1007/s11548-013-0829-3. Epub 2013 Mar 24.