Department of Biostatistics, School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran.
Bioinformatics and Computational Biology Research Center, Shiraz University of Medical Sciences, Shiraz, Iran.
Biomed Res Int. 2017;2017:7560807. doi: 10.1155/2017/7560807. Epub 2017 Dec 11.
nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers but in high dimensional setting accuracy of KNN are affected by nuisance features. In this study, we proposed the important neighbors (KIN) as a novel approach for binary classification in high dimensional problems. To avoid the curse of dimensionality, we implemented smoothly clipped absolute deviation (SCAD) logistic regression at the initial stage and considered the importance of each feature in construction of dissimilarity measure with imposing features contribution as a function of SCAD coefficients on Euclidean distance. The nature of this hybrid dissimilarity measure, which combines information of both features and distances, enjoys all good properties of SCAD penalized regression and KNN simultaneously. In comparison to KNN, simulation studies showed that KIN has a good performance in terms of both accuracy and dimension reduction. The proposed approach was found to be capable of eliminating nearly all of the noninformative features because of utilizing oracle property of SCAD penalized regression in the construction of dissimilarity measure. In very sparse settings, KIN also outperforms support vector machine (SVM) and random forest (RF) as the best classifiers.
最近邻 (KNN) 被认为是最简单的非参数分类器之一,但在高维设置中,KNN 的准确性受到干扰特征的影响。在这项研究中,我们提出了重要邻居 (KIN) 作为一种用于高维问题中二元分类的新方法。为了避免维度灾难,我们在初始阶段实现了平滑剪辑绝对偏差 (SCAD) 逻辑回归,并考虑了在构建不相似度量时每个特征的重要性,将特征贡献作为 SCAD 系数在欧几里得距离上的函数。这种混合不相似度量的性质,结合了特征和距离的信息,同时具有 SCAD 惩罚回归和 KNN 的所有良好性质。与 KNN 相比,模拟研究表明,KIN 在准确性和降维方面都有很好的性能。由于在不相似度量的构建中利用了 SCAD 惩罚回归的 oracle 性质,因此该方法能够消除几乎所有的非信息特征。在非常稀疏的情况下,KIN 也优于支持向量机 (SVM) 和随机森林 (RF),是最好的分类器。