文献检索文档翻译深度研究
Suppr Zotero 插件Zotero 插件
邀请有礼套餐&价格历史记录

新学期,新优惠

限时优惠:9月1日-9月22日

30天高级会员仅需29元

1天体验卡首发特惠仅需5.99元

了解详情
不再提醒
插件&应用
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
高级版
套餐订阅购买积分包
AI 工具
文献检索文档翻译深度研究
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2025

有限样本大小对特征选择和分类的影响:一项模拟研究。

Effect of finite sample size on feature selection and classification: a simulation study.

机构信息

Department of Radiology, University of Michigan, Ann Arbor, Michigan 48109-5842, USA.

出版信息

Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.


DOI:10.1118/1.3284974
PMID:20229900
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2826389/
Abstract

PURPOSE: The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples. METHODS: Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200. RESULTS: It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available. CONCLUSIONS: None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.

摘要

目的:在寻找最有效的特征并设计最佳的计算机辅助诊断(CAD)系统时,可用的训练和测试样本数量很少通常是一个限制因素。在有限的样本集上进行训练会导致 CAD 系统的性能相对于使用无限样本大小进行训练的性能产生偏差和方差。在这项工作中,作者进行了一项模拟研究,以评估各种分类器和特征选择技术的组合及其对类分布、维度和训练样本大小的依赖性。对这些关系的理解将有助于在可用样本有限的情况下开发有效的 CAD 系统。

方法:研究了三种特征选择技术,即逐步特征选择(SFS)、顺序浮动正向搜索(SFFS)和主成分分析(PCA),以及两种常用的分类器,Fisher 线性判别分析(LDA)和支持向量机(SVM)。从具有相等或不相等协方差矩阵和不相等均值的多元高斯分布的多维特征空间中以及从临床数据集估计的具有相等协方差矩阵和不相等均值的多维高斯分布中抽取样本。通过接收器工作特性曲线下的面积 Az 来量化分类器的性能。通过替换和保留方法获得的平均 Az 值用于评估每个类 15 到 100 个训练样本的大小。选择用于选择的模拟特征数量为 50、100 和 200。

结果:发现不同分类器和特征选择方法组合的相对性能取决于特征空间分布、维度和可用的训练样本大小。在本研究评估的大多数条件下,LDA 和具有径向核的 SVM 表现相似,尽管 SVM 分类器在某些条件下的保留性能略高于 LDA,而在其他条件下则相反。对于小样本大小,PCA 与 SFS 和 SFFS 相比,对于 LDA 表现更好,但对于多项式核的 SVM 则表现较差。对于从临床数据模拟的类分布,PCA 并没有显示出优于其他两种特征选择方法的优势。在这种情况下,当可用的训练样本较少时,具有径向核的 SVM 表现优于 LDA,而当有大量训练样本时,LDA 表现更好。

结论:在所研究的不同样本大小和特征空间分布条件下,没有一种所调查的特征选择-分类器组合始终表现出优越的性能。一般来说,SFFS 方法与 SFS 方法相当,而对于具有不相等协方差矩阵的高斯特征空间,PCA 可能具有优势。在大多数研究条件下,具有径向核的 SVM 的性能优于或与具有多项式核的 SVM 的性能相当。

相似文献

[1]
Effect of finite sample size on feature selection and classification: a simulation study.

Med Phys. 2010-2

[2]
Classifier design for computer-aided diagnosis: effects of finite sample size on the mean performance of classical and neural network classifiers.

Med Phys. 1999-12

[3]
Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size.

Med Phys. 2000-7

[4]
Classifier performance prediction for computer-aided diagnosis using a limited dataset.

Med Phys. 2008-4

[5]
Feature extraction and pattern classification of colorectal polyps in colonoscopic imaging.

Comput Med Imaging Graph. 2014-6

[6]
Computer-aided detection of lung nodules: false positive reduction using a 3D gradient field method and 3D ellipsoid fitting.

Med Phys. 2005-8

[7]
Computer aided detection of clusters of microcalcifications on full field digital mammograms.

Med Phys. 2006-8

[8]
Computer aided characterization of the solitary pulmonary nodule using volumetric and contrast enhancement features.

Acad Radiol. 2005-10

[9]
Computerized analysis of mammographic microcalcifications in morphological and texture feature spaces.

Med Phys. 1998-10

[10]
Computer-aided diagnosis of pulmonary nodules on CT scans: improvement of classification performance with nodule surface features.

Med Phys. 2009-7

引用本文的文献

[1]
Integrating Rapid Evaporative Ionization Mass Spectrometry Classification with Matrix-Assisted Laser Desorption Ionization Mass Spectrometry Imaging and Liquid Chromatography-Tandem Mass Spectrometry to Unveil Glioblastoma Overall Survival Prediction.

ACS Chem Neurosci. 2025-3-19

[2]
Key risk factors of generalized anxiety disorder in adolescents: machine learning study.

Front Public Health. 2025-1-7

[3]
Artificial intelligence-based motion tracking in cancer radiotherapy: A review.

J Appl Clin Med Phys. 2024-11

[4]
Prediction of the Ki-67 expression level in head and neck squamous cell carcinoma with machine learning-based multiparametric MRI radiomics: a multicenter study.

BMC Cancer. 2024-4-5

[5]
Editorial: Computational modelling of cardiovascular hemodynamics and machine learning.

Front Cardiovasc Med. 2024-2-22

[6]
Survival Prediction of Patients with Bladder Cancer after Cystectomy Based on Clinical, Radiomics, and Deep-Learning Descriptors.

Cancers (Basel). 2023-9-1

[7]
Machine learning for detecting Wilson's disease by amplitude of low-frequency fluctuation.

Heliyon. 2023-7-7

[8]
EEG-Driven Prediction Model of Oxcarbazepine Treatment Outcomes in Patients With Newly-Diagnosed Focal Epilepsy.

Front Med (Lausanne). 2022-1-3

[9]
Machine Learning-Based Radiomics in Neuro-Oncology.

Acta Neurochir Suppl. 2022

[10]
Neural Tracking of Sound Rhythms Correlates With Diagnosis, Severity, and Prognosis of Disorders of Consciousness.

Front Neurosci. 2021-4-28

本文引用的文献

[1]
Classifier performance prediction for computer-aided diagnosis using a limited dataset.

Med Phys. 2008-4

[2]
Support vector machines for histogram-based image classification.

IEEE Trans Neural Netw. 1999

[3]
Classifier performance estimation under the constraint of a finite sample size: resampling schemes applied to neural network classifiers.

Neural Netw. 2008

[4]
Computer-aided detection of interstitial abnormalities in chest radiographs using a reference standard based on computed tomography.

Med Phys. 2007-12

[5]
Comparison of typical evaluation methods for computer-aided diagnostic schemes: Monte Carlo simulation study.

Med Phys. 2007-3

[6]
A fully automated method for lung nodule detection from postero-anterior chest radiographs.

IEEE Trans Med Imaging. 2006-12

[7]
Computer-aided diagnosis of pulmonary nodules on CT scans: segmentation and classification using 3D active contours.

Med Phys. 2006-7

[8]
What should be expected from feature selection in small-sample settings.

Bioinformatics. 2006-10-1

[9]
Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.

BMC Bioinformatics. 2006-4-10

[10]
Analysis and minimization of overtraining effect in rule-based classifiers for computer-aided diagnosis.

Med Phys. 2006-2

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

推荐工具

医学文档翻译智能文献检索