Park Mee Young, Hastie Trevor, Tibshirani Robert
Google Inc, Mountain View, CA 94043, USA.
Biostatistics. 2007 Apr;8(2):212-27. doi: 10.1093/biostatistics/kxl002. Epub 2006 May 11.
Although averaging is a simple technique, it plays an important role in reducing variance. We use this essential property of averaging in regression of the DNA microarray data, which poses the challenge of having far more features than samples. In this paper, we introduce a two-step procedure that combines (1) hierarchical clustering and (2) Lasso. By averaging the genes within the clusters obtained from hierarchical clustering, we define supergenes and use them to fit regression models, thereby attaining concise interpretation and accuracy. Our methods are supported with theoretical justifications and demonstrated on simulated and real data sets.
虽然均值法是一种简单的技术,但它在降低方差方面起着重要作用。我们在DNA微阵列数据回归中利用均值法的这一基本特性,该数据面临着特征数量远多于样本数量的挑战。在本文中,我们介绍了一种两步法,该方法结合了(1)层次聚类和(2)套索法。通过对层次聚类得到的簇内基因求均值,我们定义了超基因,并使用它们来拟合回归模型,从而实现简洁的解释和准确性。我们的方法有理论依据支持,并在模拟数据集和真实数据集上得到了验证。