Dougherty E R
Department of Electrical Engineering, Texas A&M University, College Station, TX 77843-3128, USA.
Comp Funct Genomics. 2001;2(1):28-34. doi: 10.1002/cfg.62.
In order to study the molecular biological differences between normal and diseased tissues, it is desirable to perform classification among diseases and stages of disease using microarray-based gene-expression values. Owing to the limited number of microarrays typically used in these studies, serious issues arise with respect to the design, performance and analysis of classifiers based on microarray data. This paper reviews some fundamental issues facing small-sample classification: classification rules, constrained classifiers, error estimation and feature selection. It discusses both unconstrained and constrained classifier design from sample data, and the contributions to classifier error from constrained optimization and lack of optimality owing to design from sample data. The difficulty with estimating classifier error when confined to small samples is addressed, particularly estimating the error from training data. The impact of small samples on the ability to include more than a few variables as classifier features is explained.
为了研究正常组织与病变组织之间的分子生物学差异,期望利用基于微阵列的基因表达值对疾病及其疾病阶段进行分类。由于这些研究中通常使用的微阵列数量有限,基于微阵列数据的分类器在设计、性能和分析方面出现了严重问题。本文综述了小样本分类面临的一些基本问题:分类规则、受限分类器、误差估计和特征选择。它讨论了基于样本数据的无约束和受限分类器设计,以及受限优化对分类器误差的贡献,以及由于基于样本数据进行设计而导致的缺乏最优性。文中讨论了在限于小样本时估计分类器误差的困难,特别是从训练数据估计误差。解释了小样本对将多个变量作为分类器特征的能力的影响。