Sun Wei, Li Lexin
Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599, USA.
Biometrics. 2012 Mar;68(1):12-22. doi: 10.1111/j.1541-0420.2011.01650.x. Epub 2011 Aug 12.
Despite recent flourish of proposals on variable selection, genome-wide multiple loci mapping remains to be challenging. The majority of existing variable selection methods impose a model, and often the homoscedastic linear model, prior to selection. However, the true association between the phenotypical trait and the genetic markers is rarely known a priori, and the presence of epistatic interactions makes the association more complex than a linear relation. Model-free variable selection offers a useful alternative in this context, but the fact that the number of markers p often far exceeds the number of experimental units n renders all the existing model-free solutions that require n > p inapplicable. In this article, we examine a number of model-free variable selection methods for small-n-large-p regressions in the context of genome-wide multiple loci mapping. We propose and advocate a multivariate group-wise adaptive penalization solution, which requires no model prespecification and thus works for complex trait-marker association, and handles one variable at a time so that works for n < p. Effectiveness of the new method is demonstrated through both intensive simulations and a comprehensive real data analysis across 6100 gene expression traits.
尽管最近关于变量选择的提议大量涌现,但全基因组多基因座定位仍然具有挑战性。大多数现有的变量选择方法在选择之前会强加一个模型,而且通常是同方差线性模型。然而,表型性状与遗传标记之间的真实关联很少能先验得知,并且上位性相互作用的存在使得这种关联比线性关系更为复杂。在这种情况下,无模型变量选择提供了一种有用的替代方法,但标记数量(p)往往远远超过实验单元数量(n)这一事实使得所有现有的要求(n > p)的无模型解决方案都不适用。在本文中,我们在全基因组多基因座定位的背景下研究了一些用于小(n)大(p)回归的无模型变量选择方法。我们提出并倡导一种多变量分组自适应惩罚解决方案,该方案不需要预先设定模型,因此适用于复杂的性状 - 标记关联,并且一次处理一个变量,从而适用于(n < p)的情况。通过大量模拟和对6100个基因表达性状的全面真实数据分析,证明了新方法的有效性。