Choudhary Ashish, Brun Marcel, Hua Jianping, Lowey James, Suh Ed, Dougherty Edward R
Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
Bioinformatics. 2006 Apr 1;22(7):837-42. doi: 10.1093/bioinformatics/btl008. Epub 2006 Jan 20.
Given a large set of potential features, such as the set of all gene-expression values from a microarray, it is necessary to find a small subset with which to classify. The task of finding an optimal feature set of a given size is inherently combinatoric because to assure optimality all feature sets of a given size must be checked. Thus, numerous suboptimal feature-selection algorithms have been proposed. There are strong impediments to evaluate feature-selection algorithms using real data when data are limited, a common situation in genetic classification. The difficulty is compound. First, there are no class-conditional distributions from which to draw data points, only a single small labeled sample. Second, there are no test data with which to estimate the feature-set errors, and one must depend on a training-data-based error estimator. Finally, there is no optimal feature set with which to compare the feature sets found by the algorithms.
This paper describes a genetic test bed for the evaluation of feature-selection algorithms. It begins with a large biological feature-label dataset that is used as an empirical distribution and, using massively parallel computation, finds the top feature sets of various sizes based on a given sample size and classification rule. The user can draw random samples from the data, apply a proposed algorithm, and evaluate the proficiency of the proposed algorithm via three different measures (code provided). A key feature of the test bed is that, once a dataset is input, a single command creates the entire test bed relative to the dataset. The particular dataset used for the first version of the test bed comes from a microarray-based classification study that analyzes a large number of microarrays, prepared with RNA from breast tumor samples from each of 295 patients.
The software and supplementary material are available at http://public.tgen.org/tgen-cb/support/testbed/
给定大量潜在特征,比如来自微阵列的所有基因表达值集合,有必要找到一个用于分类的小子集。寻找给定大小的最优特征集的任务本质上是组合性的,因为为确保最优性,必须检查给定大小的所有特征集。因此,人们提出了许多次优的特征选择算法。当数据有限时(这在基因分类中是常见情况),使用真实数据评估特征选择算法存在很大障碍。困难是多方面的。首先,没有类条件分布可从中抽取数据点,只有一个小的带标签样本。其次,没有测试数据来估计特征集误差,必须依赖基于训练数据的误差估计器。最后,没有最优特征集可用于比较算法找到的特征集。
本文描述了一个用于评估特征选择算法的基因测试平台。它从一个大型生物特征 - 标签数据集开始,该数据集用作经验分布,并利用大规模并行计算,基于给定样本大小和分类规则找到各种大小的顶级特征集。用户可以从数据中抽取随机样本,应用所提出的算法,并通过三种不同度量(提供了代码)评估所提出算法的熟练度。该测试平台的一个关键特性是,一旦输入一个数据集,一个命令就会创建相对于该数据集的整个测试平台。用于测试平台第一个版本的特定数据集来自一项基于微阵列的分类研究,该研究分析了大量微阵列,这些微阵列是用来自295名患者中每一位的乳腺肿瘤样本的RNA制备的。
软件和补充材料可在http://public.tgen.org/tgen - cb/support/testbed/获取。