National ICT Australia, Victoria Research Laboratory, Parkville, Australia.
Bioinformatics. 2012 Jan 15;28(2):151-9. doi: 10.1093/bioinformatics/btr644. Epub 2011 Nov 21.
Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options.
We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer.
FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.
特征选择是微阵列数据集机器学习中的一个关键概念,其中探针表示的特征通常比可用样本数量大几个数量级。对于处理超过十万个特征的超高维数据集,如单核苷酸多态性微阵列产生的数据集,计算可处理性是特征选择算法的一个关键挑战。在本文中,我们提出了一种新颖的特征集约简方法,能够在具有数十万特征的数据集上实现可扩展的特征选择。我们的方法能够更有效地处理更高分辨率的数据集,从而实现更好的样本疾病亚型分类,从而实现更准确的诊断和预后,这使得临床医生能够在患者治疗方案方面做出更明智的决策。
我们将特征集约简方法应用于几个公开可用的癌症单核苷酸多态性(SNP)阵列数据集,并根据其在不同癌症亚型的多类预测分类准确性、执行速度以及相对于样本大小和阵列分辨率的可扩展性来评估其性能。特征集约简(FSR)能够将 SNP 阵列数据集的维数减少两个数量级以上,同时在大多数情况下,其预测分类性能至少与仅通过现有特征选择方法选择的特征相同,在许多情况下甚至更好。对 FSR 约简特征集中经常选择的特征的生物学相关性进行检查,发现它们与癌症之间存在强烈的关联。
FSR 是用 MATLAB R2010b 实现的,可以在 http://ww2.cs.mu.oz.au/~gwong/FSR 上找到。