有限样本大小对特征选择和分类的影响：一项模拟研究。

Effect of finite sample size on feature selection and classification: a simulation study.

机构信息

Department of Radiology, University of Michigan, Ann Arbor, Michigan 48109-5842, USA.

出版信息

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

DOI:10.1118/1.3284974

PMID:20229900

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2826389/

Abstract

PURPOSE

The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.

METHODS

Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.

RESULTS

It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.

CONCLUSIONS

None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

摘要

目的

在寻找最有效的特征并设计最佳的计算机辅助诊断（CAD）系统时，可用的训练和测试样本数量很少通常是一个限制因素。在有限的样本集上进行训练会导致 CAD 系统的性能相对于使用无限样本大小进行训练的性能产生偏差和方差。在这项工作中，作者进行了一项模拟研究，以评估各种分类器和特征选择技术的组合及其对类分布、维度和训练样本大小的依赖性。对这些关系的理解将有助于在可用样本有限的情况下开发有效的 CAD 系统。

方法

研究了三种特征选择技术，即逐步特征选择（SFS）、顺序浮动正向搜索（SFFS）和主成分分析（PCA），以及两种常用的分类器，Fisher 线性判别分析（LDA）和支持向量机（SVM）。从具有相等或不相等协方差矩阵和不相等均值的多元高斯分布的多维特征空间中以及从临床数据集估计的具有相等协方差矩阵和不相等均值的多维高斯分布中抽取样本。通过接收器工作特性曲线下的面积 Az 来量化分类器的性能。通过替换和保留方法获得的平均 Az 值用于评估每个类 15 到 100 个训练样本的大小。选择用于选择的模拟特征数量为 50、100 和 200。

结果

发现不同分类器和特征选择方法组合的相对性能取决于特征空间分布、维度和可用的训练样本大小。在本研究评估的大多数条件下，LDA 和具有径向核的 SVM 表现相似，尽管 SVM 分类器在某些条件下的保留性能略高于 LDA，而在其他条件下则相反。对于小样本大小，PCA 与 SFS 和 SFFS 相比，对于 LDA 表现更好，但对于多项式核的 SVM 则表现较差。对于从临床数据模拟的类分布，PCA 并没有显示出优于其他两种特征选择方法的优势。在这种情况下，当可用的训练样本较少时，具有径向核的 SVM 表现优于 LDA，而当有大量训练样本时，LDA 表现更好。

结论

在所研究的不同样本大小和特征空间分布条件下，没有一种所调查的特征选择-分类器组合始终表现出优越的性能。一般来说，SFFS 方法与 SFS 方法相当，而对于具有不相等协方差矩阵的高斯特征空间，PCA 可能具有优势。在大多数研究条件下，具有径向核的 SVM 的性能优于或与具有多项式核的 SVM 的性能相当。

相似文献

Effect of finite sample size on feature selection and classification: a simulation study.

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.

Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers.

Med Phys. 1999 Dec;26(12):2654-68. doi: 10.1118/1.598805.

Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size.

Med Phys. 2000 Jul;27(7):1509-22. doi: 10.1118/1.599017.

Classifier performance prediction for computer-aided diagnosis using a limited dataset.

Med Phys. 2008 Apr;35(4):1559-70. doi: 10.1118/1.2868757.

Feature extraction and pattern classification of colorectal polyps in colonoscopic imaging.

Comput Med Imaging Graph. 2014 Jun;38(4):267-75. doi: 10.1016/j.compmedimag.2013.12.009. Epub 2014 Jan 2.

Computer-aided detection of lung nodules: false positive reduction using a 3D gradient field method and 3D ellipsoid fitting.

Med Phys. 2005 Aug;32(8):2443-54. doi: 10.1118/1.1944667.

Computer aided detection of clusters of microcalcifications on full field digital mammograms.

Med Phys. 2006 Aug;33(8):2975-88. doi: 10.1118/1.2211710.

Computer aided characterization of the solitary pulmonary nodule using volumetric and contrast enhancement features.

Acad Radiol. 2005 Oct;12(10):1310-9. doi: 10.1016/j.acra.2005.06.005.

Computerized analysis of mammographic microcalcifications in morphological and texture feature spaces.

Med Phys. 1998 Oct;25(10):2007-19. doi: 10.1118/1.598389.

Computer-aided diagnosis of pulmonary nodules on CT scans: improvement of classification performance with nodule surface features.

Med Phys. 2009 Jul;36(7):3086-98. doi: 10.1118/1.3140589.

引用本文的文献

Integrating Rapid Evaporative Ionization Mass Spectrometry Classification with Matrix-Assisted Laser Desorption Ionization Mass Spectrometry Imaging and Liquid Chromatography-Tandem Mass Spectrometry to Unveil Glioblastoma Overall Survival Prediction.

ACS Chem Neurosci. 2025 Mar 19;16(6):1021-1033. doi: 10.1021/acschemneuro.4c00463. Epub 2025 Feb 25.

Key risk factors of generalized anxiety disorder in adolescents: machine learning study.

Front Public Health. 2025 Jan 7;12:1504739. doi: 10.3389/fpubh.2024.1504739. eCollection 2024.

Artificial intelligence-based motion tracking in cancer radiotherapy: A review.

J Appl Clin Med Phys. 2024 Nov;25(11):e14500. doi: 10.1002/acm2.14500. Epub 2024 Aug 28.

Prediction of the Ki-67 expression level in head and neck squamous cell carcinoma with machine learning-based multiparametric MRI radiomics: a multicenter study.

BMC Cancer. 2024 Apr 5;24(1):418. doi: 10.1186/s12885-024-12026-x.

Editorial: Computational modelling of cardiovascular hemodynamics and machine learning.

Front Cardiovasc Med. 2024 Feb 22;11:1355843. doi: 10.3389/fcvm.2024.1355843. eCollection 2024.

Survival Prediction of Patients with Bladder Cancer after Cystectomy Based on Clinical, Radiomics, and Deep-Learning Descriptors.

Cancers (Basel). 2023 Sep 1;15(17):4372. doi: 10.3390/cancers15174372.

Machine learning for detecting Wilson's disease by amplitude of low-frequency fluctuation.

Heliyon. 2023 Jul 7;9(7):e18087. doi: 10.1016/j.heliyon.2023.e18087. eCollection 2023 Jul.

EEG-Driven Prediction Model of Oxcarbazepine Treatment Outcomes in Patients With Newly-Diagnosed Focal Epilepsy.

Front Med (Lausanne). 2022 Jan 3;8:781937. doi: 10.3389/fmed.2021.781937. eCollection 2021.

Machine Learning-Based Radiomics in Neuro-Oncology.

Acta Neurochir Suppl. 2022;134:139-151. doi: 10.1007/978-3-030-85292-4_18.

Neural Tracking of Sound Rhythms Correlates With Diagnosis, Severity, and Prognosis of Disorders of Consciousness.

Front Neurosci. 2021 Apr 28;15:646543. doi: 10.3389/fnins.2021.646543. eCollection 2021.

本文引用的文献

Classifier performance prediction for computer-aided diagnosis using a limited dataset.

Med Phys. 2008 Apr;35(4):1559-70. doi: 10.1118/1.2868757.

Support vector machines for histogram-based image classification.

IEEE Trans Neural Netw. 1999;10(5):1055-64. doi: 10.1109/72.788646.

Classifier performance estimation under the constraint of a finite sample size: resampling schemes applied to neural network classifiers.

Neural Netw. 2008 Mar-Apr;21(2-3):476-83. doi: 10.1016/j.neunet.2007.12.012. Epub 2007 Dec 17.

Computer-aided detection of interstitial abnormalities in chest radiographs using a reference standard based on computed tomography.

Med Phys. 2007 Dec;34(12):4798-809. doi: 10.1118/1.2795672.

Comparison of typical evaluation methods for computer-aided diagnostic schemes: Monte Carlo simulation study.

Med Phys. 2007 Mar;34(3):871-6. doi: 10.1118/1.2437130.

A fully automated method for lung nodule detection from postero-anterior chest radiographs.

IEEE Trans Med Imaging. 2006 Dec;25(12):1588-603. doi: 10.1109/tmi.2006.884198.

Computer-aided diagnosis of pulmonary nodules on CT scans: segmentation and classification using 3D active contours.

Med Phys. 2006 Jul;33(7):2323-37. doi: 10.1118/1.2207129.

What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006 Oct 1;22(19):2430-6. doi: 10.1093/bioinformatics/btl407. Epub 2006 Jul 26.

Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.

BMC Bioinformatics. 2006 Apr 10;7:197. doi: 10.1186/1471-2105-7-197.

Analysis and minimization of overtraining effect in rule-based classifiers for computer-aided diagnosis.

Med Phys. 2006 Feb;33(2):320-8. doi: 10.1118/1.1999126.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

有限样本大小对特征选择和分类的影响：一项模拟研究。

Effect of finite sample size on feature selection and classification: a simulation study.

机构信息

出版信息

PURPOSE

METHODS

RESULTS

CONCLUSIONS

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献