Martini Johannes W R, Rosales Francisco, Ha Ngoc-Thuy, Heise Johannes, Wimmer Valentin, Kneib Thomas
KWS SAAT SE, Einbeck, Germany
Universidad del Pacífico, Academic Department of Finance, Lima, Peru.
G3 (Bethesda). 2019 Apr 9;9(4):1117-1129. doi: 10.1534/g3.118.200961.
Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.
混合模型可被视为一种惩罚回归类型,是统计遗传学中常用的工具。全基因组回归(WGR)的标准混合模型是(RRBLUP),它基于加性标记效应模型。许多文献通过纳入位点间或基因与环境间的相互作用,扩展了加性WGR方法。在这种带有相互作用的惩罚回归背景下,有报道称,例如将编码从 -1,0,1 转换为 0,1,2 会对遗传值和相互作用效应的预测产生影响。在这项工作中,我们确定了在惩罚多项式回归的一般背景下变量编码相关性的原因。我们表明,在许多情况下,遗传值的预测对于变量编码的转换并非不变,只有在仅对最高总次数单项式的系数大小进行惩罚的情况下是个例外。RRBLUP的不变性可被视为这种情况的一个特殊情形,即总次数为1的多项式,惩罚加性效应(总次数1)但不惩罚固定效应(总次数0)。包含相互作用的扩展RRBLUP(eRRBLUP)对于编码转换不是不变的,因为它不仅惩罚相互作用(总次数2),还惩罚加性效应(总次数1)。这一观察结果意味着,如果仅惩罚相互作用效应而不惩罚加性效应,那么在成对上位性WGR中可以保持转换不变性。在这方面,预先选择位点的方法不仅可以减少计算时间,还可以帮助避免变量编码问题。为了说明实际相关性,我们在一个公开可用的小麦数据集上比较了不同回归。我们表明,对于eRRBLUP,标记编码对相互作用效应估计的相关性随着模型中包含的变量数量增加而增加。因此,对估计的相互作用效应进行生物学解释可能会变得更加困难。因此,将(RKHS)方法与明确建模效应的WGR方法进行比较,后者所谓的更高可解释性优势可能并不真实。我们的理论结果对于惩罚回归一般是有效的,例如对于(LASSO)也是如此。此外,它们适用于通过惩罚回归方法中预测变量的乘积或混合模型中协方差矩阵的哈达玛积建模的任何类型的相互作用。