Department of Radiology, University of Michigan, Ann Arbor, Michigan 48109-5842, USA.
Med Phys. 2010 Feb;37(2):907-20. doi: 10.1118/1.3284974.
PURPOSE: The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples. METHODS: Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher's linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200. RESULTS: It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available. CONCLUSIONS: None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.
目的:在寻找最有效的特征并设计最佳的计算机辅助诊断(CAD)系统时,可用的训练和测试样本数量很少通常是一个限制因素。在有限的样本集上进行训练会导致 CAD 系统的性能相对于使用无限样本大小进行训练的性能产生偏差和方差。在这项工作中,作者进行了一项模拟研究,以评估各种分类器和特征选择技术的组合及其对类分布、维度和训练样本大小的依赖性。对这些关系的理解将有助于在可用样本有限的情况下开发有效的 CAD 系统。
方法:研究了三种特征选择技术,即逐步特征选择(SFS)、顺序浮动正向搜索(SFFS)和主成分分析(PCA),以及两种常用的分类器,Fisher 线性判别分析(LDA)和支持向量机(SVM)。从具有相等或不相等协方差矩阵和不相等均值的多元高斯分布的多维特征空间中以及从临床数据集估计的具有相等协方差矩阵和不相等均值的多维高斯分布中抽取样本。通过接收器工作特性曲线下的面积 Az 来量化分类器的性能。通过替换和保留方法获得的平均 Az 值用于评估每个类 15 到 100 个训练样本的大小。选择用于选择的模拟特征数量为 50、100 和 200。
结果:发现不同分类器和特征选择方法组合的相对性能取决于特征空间分布、维度和可用的训练样本大小。在本研究评估的大多数条件下,LDA 和具有径向核的 SVM 表现相似,尽管 SVM 分类器在某些条件下的保留性能略高于 LDA,而在其他条件下则相反。对于小样本大小,PCA 与 SFS 和 SFFS 相比,对于 LDA 表现更好,但对于多项式核的 SVM 则表现较差。对于从临床数据模拟的类分布,PCA 并没有显示出优于其他两种特征选择方法的优势。在这种情况下,当可用的训练样本较少时,具有径向核的 SVM 表现优于 LDA,而当有大量训练样本时,LDA 表现更好。
结论:在所研究的不同样本大小和特征空间分布条件下,没有一种所调查的特征选择-分类器组合始终表现出优越的性能。一般来说,SFFS 方法与 SFS 方法相当,而对于具有不相等协方差矩阵的高斯特征空间,PCA 可能具有优势。在大多数研究条件下,具有径向核的 SVM 的性能优于或与具有多项式核的 SVM 的性能相当。
Comput Med Imaging Graph. 2014-6
Front Public Health. 2025-1-7
J Appl Clin Med Phys. 2024-11
Front Cardiovasc Med. 2024-2-22
Front Med (Lausanne). 2022-1-3
Acta Neurochir Suppl. 2022
IEEE Trans Neural Netw. 1999
IEEE Trans Med Imaging. 2006-12
Bioinformatics. 2006-10-1
BMC Bioinformatics. 2006-4-10