Ooi Chia Huey, Chetty Madhu, Teng Shyh Wei
Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia.
BMC Bioinformatics. 2006 Jun 23;7:320. doi: 10.1186/1471-2105-7-320.
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.
We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets.
For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
由于典型微阵列数据集中存在大量基因,特征选择在基于基因表达的组织分类中,对于减少噪声和计算成本、同时提高准确性方面似乎起着重要作用。令人惊讶的是,并非所有多类微阵列数据集都是如此。原因在于,应用于微阵列数据集的许多特征选择技术要么基于排序,因此没有考虑基因之间的相关性,要么基于包装法,这需要很高的计算成本,而且往往产生难以重现的结果。在考虑基因相关性的研究中,对所提出技术优点的评估因评估程序不够细致而受到阻碍,导致对准确性的估计过于乐观。
我们提出了两种经过实际评估的基于相关性的特征选择技术,除了形成预测集所涉及的两个现有标准(相关性和冗余性)之外,还纳入了第三个标准,即差异优先级程度(DDP)。DDP作为一个参数,在相关性和冗余性之间取得平衡,使我们的技术具有以不同方式优先优化相关性与冗余性(反之亦然)的新能力。对于九个著名的多类微阵列数据集,这种能力在使用合理小的预测集规模的同时,有助于产生最佳分类准确率。
对于多类微阵列数据集,尤其是GCM和NCI60数据集,DDP使我们基于过滤的技术能够产生比以往采用类似实际评估程序的研究报告中更高的准确率。