Nuffield College, Oxford OX1 1NF, United Kingdom;
Department of Mathematics, Imperial College London, London SW7 2AZ, United Kingdom
Proc Natl Acad Sci U S A. 2017 Aug 8;114(32):8592-8595. doi: 10.1073/pnas.1703764114. Epub 2017 Jul 24.
Data with a relatively small number of study individuals and a very large number of potential explanatory features arise particularly, but by no means only, in genomics. A powerful method of analysis, the lasso [Tibshirani R (1996) 58:267-288], takes account of an assumed sparsity of effects, that is, that most of the features are nugatory. Standard criteria for model fitting, such as the method of least squares, are modified by imposing a penalty for each explanatory variable used. There results a single model, leaving open the possibility that other sparse choices of explanatory features fit virtually equally well. The method suggested in this paper aims to specify simple models that are essentially equally effective, leaving detailed interpretation to the specifics of the particular study. The method hinges on the ability to make initially a very large number of separate analyses, allowing each explanatory feature to be assessed in combination with many other such features. Further stages allow the assessment of more complex patterns such as nonlinear and interactive dependences. The method has formal similarities to so-called partially balanced incomplete block designs introduced 80 years ago [Yates F (1936) 26:424-455] for the study of large-scale plant breeding trials. The emphasis in this paper is strongly on exploratory analysis; the more formal statistical properties obtained under idealized assumptions will be reported separately.
数据的研究个体数量相对较少,而潜在的解释性特征数量非常多,这种情况尤其出现在基因组学中,但也不仅限于此。一种强大的分析方法,即套索(lasso)[Tibshirani R(1996)58:267-288],考虑到了效应的稀疏性假设,也就是说,大多数特征都是无用的。标准的模型拟合标准,如最小二乘法,通过对使用的每个解释变量施加惩罚来进行修正。这样就得到了一个单一的模型,同时也为其他稀疏选择的解释性特征几乎同样好的拟合留下了可能性。本文提出的方法旨在指定简单的模型,这些模型本质上同样有效,将详细的解释留给特定研究的具体情况。该方法取决于能够最初进行大量的单独分析的能力,允许每个解释性特征与许多其他此类特征结合进行评估。进一步的阶段允许评估更复杂的模式,如非线性和交互依赖性。该方法与 80 年前为大规模植物育种试验研究引入的所谓部分平衡不完全区组设计[Yates F(1936)26:424-455]具有形式上的相似性。本文的重点强烈放在探索性分析上;在理想化假设下获得的更正式的统计性质将单独报告。