Choi Byeong Yeob, Bair Eric, Lee Jae Won
Department of Epidemiology and Biostatistics, University of Texas Health Science Center, San Antonio, TX, United States of America.
Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States of America.
PLoS One. 2017 Feb 15;12(2):e0171068. doi: 10.1371/journal.pone.0171068. eCollection 2017.
Nearest shrunken centroids (NSC) is a popular classification method for microarray data. NSC calculates centroids for each class and "shrinks" the centroids toward 0 using soft thresholding. Future observations are then assigned to the class with the minimum distance between the observation and the (shrunken) centroid. Under certain conditions the soft shrinkage used by NSC is equivalent to a LASSO penalty. However, this penalty can produce biased estimates when the true coefficients are large. In addition, NSC ignores the fact that multiple measures of the same gene are likely to be related to one another. We consider several alternative genewise shrinkage methods to address the aforementioned shortcomings of NSC. Three alternative penalties were considered: the smoothly clipped absolute deviation (SCAD), the adaptive LASSO (ADA), and the minimax concave penalty (MCP). We also showed that NSC can be performed in a genewise manner. Classification methods were derived for each alternative shrinkage method or alternative genewise penalty, and the performance of each new classification method was compared with that of conventional NSC on several simulated and real microarray data sets. Moreover, we applied the geometric mean approach for the alternative penalty functions. In general the alternative (genewise) penalties required fewer genes than NSC. The geometric mean of the class-specific prediction accuracies was improved, as well as the overall predictive accuracy in some cases. These results indicate that these alternative penalties should be considered when using NSC.
最近收缩质心(NSC)是一种用于微阵列数据的流行分类方法。NSC为每个类别计算质心,并使用软阈值将质心“收缩”至0。然后将未来的观测值分配到与观测值和(收缩后的)质心之间距离最小的类别。在某些条件下,NSC使用的软收缩等同于LASSO惩罚。然而,当真实系数较大时,这种惩罚可能会产生有偏差的估计。此外,NSC忽略了同一基因的多个测量值可能相互关联这一事实。我们考虑了几种替代的基因层面收缩方法来解决NSC的上述缺点。考虑了三种替代惩罚:平滑截断绝对偏差(SCAD)、自适应LASSO(ADA)和最小最大凹惩罚(MCP)。我们还表明NSC可以在基因层面进行。针对每种替代收缩方法或替代基因层面惩罚推导了分类方法,并在几个模拟和真实微阵列数据集上,将每种新分类方法的性能与传统NSC的性能进行了比较。此外,我们对替代惩罚函数应用了几何平均方法。一般来说,替代(基因层面)惩罚所需的基因比NSC少。特定类别的预测准确率的几何平均值得到了提高,在某些情况下整体预测准确率也得到了提高。这些结果表明,在使用NSC时应考虑这些替代惩罚。