使用差异优先级排序找到的预测集的特征。

Characteristics of predictor sets found using differential prioritization.

作者信息

Ooi Chia Huey, Chetty Madhu, Teng Shyh Wei

机构信息

Gippsland School of Information Technology, Monash University, Churchill, VIC 3842, Australia.

出版信息

Algorithms Mol Biol. 2007 Jun 4;2:7. doi: 10.1186/1748-7188-2-7.

DOI:10.1186/1748-7188-2-7

PMID:17547742

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC1920513/

Abstract

BACKGROUND

Feature selection plays an undeniably important role in classification problems involving high dimensional datasets such as microarray datasets. For filter-based feature selection, two well-known criteria used in forming predictor sets are relevance and redundancy. However, there is a third criterion which is at least as important as the other two in affecting the efficacy of the resulting predictor sets. This criterion is the degree of differential prioritization (DDP), which varies the emphases on relevance and redundancy depending on the value of the DDP. Previous empirical works on publicly available microarray datasets have confirmed the effectiveness of the DDP in molecular classification. We now propose to establish the fundamental strengths and merits of the DDP-based feature selection technique. This is to be done through a simulation study which involves vigorous analyses of the characteristics of predictor sets found using different values of the DDP from toy datasets designed to mimic real-life microarray datasets.

RESULTS

A simulation study employing analytical measures such as the distance between classes before and after transformation using principal component analysis is implemented on toy datasets. From these analyses, the necessity of adjusting the differential prioritization based on the dataset of interest is established. This conclusion is supported by comparisons against both simplistic rank-based selection and state-of-the-art equal-priorities scoring methods, which demonstrates the superiority of the DDP-based feature selection technique. Reapplying similar analyses to real-life multiclass microarray datasets provides further confirmation of our findings and of the significance of the DDP for practical applications.

CONCLUSION

The findings have been achieved based on analytical evaluations, not empirical evaluation involving classifiers, thus providing further basis for the usefulness of the DDP and validating the need for unequal priorities on relevance and redundancy during feature selection for microarray datasets, especially highly multiclass datasets.

摘要

背景

在涉及高维数据集（如微阵列数据集）的分类问题中，特征选择无疑起着重要作用。对于基于过滤的特征选择，用于形成预测集的两个著名标准是相关性和冗余性。然而，还有第三个标准，它在影响最终预测集的有效性方面至少与其他两个标准同样重要。这个标准是差异优先级程度（DDP），它根据DDP的值改变对相关性和冗余性的强调。先前对公开可用微阵列数据集的实证研究已经证实了DDP在分子分类中的有效性。我们现在提议确立基于DDP的特征选择技术的基本优势和优点。这将通过一项模拟研究来完成，该研究涉及对从旨在模拟现实生活微阵列数据集的玩具数据集中使用不同DDP值找到的预测集的特征进行深入分析。

结果

在玩具数据集上进行了一项模拟研究，采用了诸如使用主成分分析进行变换前后类间距离等分析方法。从这些分析中，确定了根据感兴趣的数据集调整差异优先级的必要性。与简单的基于排名的选择和最先进的等优先级评分方法的比较支持了这一结论，这证明了基于DDP的特征选择技术的优越性。将类似的分析重新应用于现实生活中的多类微阵列数据集，进一步证实了我们的发现以及DDP在实际应用中的重要性。