Okun Oleg, Priisalu Helen
University of Oulu, Department of Electrical and Information Engineering, P.O. Box 4500, Oulu 90014, Finland.
Artif Intell Med. 2009 Feb-Mar;45(2-3):151-62. doi: 10.1016/j.artmed.2008.08.004. Epub 2008 Sep 14.
We explore the link between dataset complexity, determining how difficult a dataset is for classification, and classification performance defined by low-variance and low-biased bolstered resubstitution error made by k-nearest neighbor classifiers.
Gene expression based cancer classification is used as the task in this study. Six gene expression datasets containing different types of cancer constitute test data.
Through extensive simulation coupled with the copula method for analysis of association in bivariate data, we show that dataset complexity and bolstered resubstitution error are associated in terms of dependence. As a result, we propose a new scheme for generating ensembles of classifiers that selects subsets of features of low complexity for ensemble members, which constitutes the accurate members according to the found dependence relation.
Experiments with six gene expression datasets demonstrate that our ensemble generating scheme based on the dependence of dataset complexity and classification error is superior to a single best classifier in the ensemble and to the traditional ensemble construction scheme that is ignorant of dataset complexity.
我们探究数据集复杂性(确定一个数据集对于分类的难度)与由k近邻分类器产生的低方差和低偏差增强重替代误差所定义的分类性能之间的联系。
基于基因表达的癌症分类被用作本研究中的任务。六个包含不同类型癌症的基因表达数据集构成测试数据。
通过广泛的模拟以及用于分析双变量数据中关联的copula方法,我们表明数据集复杂性和增强重替代误差在依赖性方面是相关的。因此,我们提出了一种新的生成分类器集成的方案,该方案为集成成员选择低复杂性的特征子集,这些子集根据所发现的依赖关系构成准确的成员。
对六个基因表达数据集的实验表明,我们基于数据集复杂性和分类误差依赖性的集成生成方案优于集成中的单个最佳分类器以及忽略数据集复杂性的传统集成构建方案。