翻译中的迷失：关于带交互项的惩罚全基因组回归中的数据编码问题

Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions.

作者信息

Martini Johannes W R, Rosales Francisco, Ha Ngoc-Thuy, Heise Johannes, Wimmer Valentin, Kneib Thomas

机构信息

KWS SAAT SE, Einbeck, Germany

Universidad del Pacífico, Academic Department of Finance, Lima, Peru.

出版信息

G3 (Bethesda). 2019 Apr 9;9(4):1117-1129. doi: 10.1534/g3.118.200961.

DOI:10.1534/g3.118.200961

PMID:30760541

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6469405/

Abstract

Mixed models can be considered as a type of penalized regression and are everyday tools in statistical genetics. The standard mixed model for whole genome regression (WGR) is (RRBLUP) which is based on an additive marker effect model. Many publications have extended the additive WGR approach by incorporating interactions between loci or between genes and environment. In this context of penalized regressions with interactions, it has been reported that translating the coding of -for instance from -1,0,1 to 0,1,2- has an impact on the prediction of genetic values and interaction effects. In this work, we identify the reason for the relevance of variable coding in the general context of penalized polynomial regression. We show that in many cases, predictions of the genetic values are not invariant to translations of the variable coding, with an exception when only the sizes of the coefficients of monomials of highest total degree are penalized. The invariance of RRBLUP can be considered as a special case of this setting, with a polynomial of total degree 1, penalizing additive effects (total degree 1) but not the fixed effect (total degree 0). The extended RRBLUP (eRRBLUP), which includes interactions, is not invariant to translations because it does not only penalize interactions (total degree 2), but also additive effects (total degree 1). This observation implies that translation-invariance can be maintained in a pair-wise epistatic WGR if only interaction effects are penalized, but not the additive effects. In this regard, approaches of pre-selecting loci may not only reduce computation time, but can also help to avoid the variable coding issue. To illustrate the practical relevance, we compare different regressions on a publicly available wheat data set. We show that for an eRRBLUP, the relevance of the marker coding for interaction effect estimates increases with the number of variables included in the model. A biological interpretation of estimated interaction effects may therefore become more difficult. Consequently, comparing (RKHS) approaches to WGR approaches modeling effects explicitly, the supposed advantage of an increased interpretability of the latter may not be real. Our theoretical results are generally valid for penalized regressions, for instance also for the (LASSO). Moreover, they apply to any type of interaction modeled by products of predictor variables in a penalized regression approach or by Hadamard products of covariance matrices in a mixed model.

摘要

混合模型可被视为一种惩罚回归类型，是统计遗传学中常用的工具。全基因组回归（WGR）的标准混合模型是（RRBLUP），它基于加性标记效应模型。许多文献通过纳入位点间或基因与环境间的相互作用，扩展了加性WGR方法。在这种带有相互作用的惩罚回归背景下，有报道称，例如将编码从 -1,0,1 转换为 0,1,2 会对遗传值和相互作用效应的预测产生影响。在这项工作中，我们确定了在惩罚多项式回归的一般背景下变量编码相关性的原因。我们表明，在许多情况下，遗传值的预测对于变量编码的转换并非不变，只有在仅对最高总次数单项式的系数大小进行惩罚的情况下是个例外。RRBLUP的不变性可被视为这种情况的一个特殊情形，即总次数为1的多项式，惩罚加性效应（总次数1）但不惩罚固定效应（总次数0）。包含相互作用的扩展RRBLUP（eRRBLUP）对于编码转换不是不变的，因为它不仅惩罚相互作用（总次数2），还惩罚加性效应（总次数1）。这一观察结果意味着，如果仅惩罚相互作用效应而不惩罚加性效应，那么在成对上位性WGR中可以保持转换不变性。在这方面，预先选择位点的方法不仅可以减少计算时间，还可以帮助避免变量编码问题。为了说明实际相关性，我们在一个公开可用的小麦数据集上比较了不同回归。我们表明，对于eRRBLUP，标记编码对相互作用效应估计的相关性随着模型中包含的变量数量增加而增加。因此，对估计的相互作用效应进行生物学解释可能会变得更加困难。因此，将（RKHS）方法与明确建模效应的WGR方法进行比较，后者所谓的更高可解释性优势可能并不真实。我们的理论结果对于惩罚回归一般是有效的，例如对于（LASSO）也是如此。此外，它们适用于通过惩罚回归方法中预测变量的乘积或混合模型中协方差矩阵的哈达玛积建模的任何类型的相互作用。

相似文献

Lost in Translation: On the Problem of Data Coding in Penalized Whole Genome Regression with Interactions.翻译中的迷失：关于带交互项的惩罚全基因组回归中的数据编码问题

G3 (Bethesda). 2019 Apr 9;9(4):1117-1129. doi: 10.1534/g3.118.200961.

Modeling Epistasis in Genomic Selection.遗传选择中的上位性建模。

Genetics. 2015 Oct;201(2):759-68. doi: 10.1534/genetics.115.177907. Epub 2015 Jul 27.

Genome-wide prediction using Bayesian additive regression trees.使用贝叶斯加法回归树进行全基因组预测。

Genet Sel Evol. 2016 Jun 10;48(1):42. doi: 10.1186/s12711-016-0219-8.

Comparison of Models and Whole-Genome Profiling Approaches for Genomic-Enabled Prediction of Septoria Tritici Blotch, Stagonospora Nodorum Blotch, and Tan Spot Resistance in Wheat.基于基因组的小麦叶锈病、条锈病和叶枯病抗性预测模型和全基因组分析方法的比较。

Plant Genome. 2017 Jul;10(2). doi: 10.3835/plantgenome2016.08.0082.

Increased prediction accuracy in wheat breeding trials using a marker × environment interaction genomic selection model.使用标记×环境互作基因组选择模型提高小麦育种试验中的预测准确性。

G3 (Bethesda). 2015 Feb 6;5(4):569-82. doi: 10.1534/g3.114.016097.

A Unified and Comprehensible View of Parametric and Kernel Methods for Genomic Prediction with Application to Rice.基因组预测中参数方法和核方法的统一且可理解的观点及其在水稻中的应用

Front Genet. 2016 Aug 9;7:145. doi: 10.3389/fgene.2016.00145. eCollection 2016.

Genomic prediction with epistasis models: on the marker-coding-dependent performance of the extended GBLUP and properties of the categorical epistasis model (CE).基于上位性模型的基因组预测：关于扩展GBLUP的标记编码依赖性性能及分类上位性模型（CE）的性质

BMC Bioinformatics. 2017 Jan 3;18(1):3. doi: 10.1186/s12859-016-1439-1.

Efficient Implementation of Penalized Regression for Genetic Risk Prediction.高效实现基于惩罚回归的遗传风险预测。

Genetics. 2019 May;212(1):65-74. doi: 10.1534/genetics.119.302019. Epub 2019 Feb 26.

Identification of clinically relevant features in hypertensive patients using penalized regression: a case study of cardiovascular events.使用惩罚回归识别高血压患者的临床相关特征：心血管事件的案例研究。

Med Biol Eng Comput. 2019 Sep;57(9):2011-2026. doi: 10.1007/s11517-019-02007-9. Epub 2019 Jul 25.

Genome-wide regression and prediction with the BGLR statistical package.使用BGLR统计软件包进行全基因组回归与预测。

Genetics. 2014 Oct;198(2):483-95. doi: 10.1534/genetics.114.164442. Epub 2014 Jul 9.

引用本文的文献

MIDESP: Mutual Information-Based Detection of Epistatic SNP Pairs for Qualitative and Quantitative Phenotypes.MIDESP：基于互信息的定性和定量表型上位性SNP对检测

Biology (Basel). 2021 Sep 16;10(9):921. doi: 10.3390/biology10090921.

Accounting for epistasis improves genomic prediction of phenotypes with univariate and bivariate models across environments.在单变量和双变量模型中，考虑上位性可提高表型的基因组预测在不同环境下的准确性。

Theor Appl Genet. 2021 Sep;134(9):2913-2930. doi: 10.1007/s00122-021-03868-1. Epub 2021 Jun 11.

Efficient Algorithms for Calculating Epistatic Genomic Relationship Matrices.高效算法计算上位基因组关系矩阵。

Genetics. 2020 Nov;216(3):651-669. doi: 10.1534/genetics.120.303459. Epub 2020 Sep 24.

Phantom Epistasis in Genomic Selection: On the Predictive Ability of Epistatic Models.基因组选择中的幽灵上位性：上位性模型的预测能力研究

G3 (Bethesda). 2020 Sep 2;10(9):3137-3145. doi: 10.1534/g3.120.401300.

Homeologous Epistasis in Wheat: The Search for an Immortal Hybrid.小麦同源上位性：寻找不朽杂种。

Genetics. 2019 Mar;211(3):1105-1122. doi: 10.1534/genetics.118.301851. Epub 2019 Jan 24.

本文引用的文献

Genomic Model with Correlation Between Additive and Dominance Effects.具有加性和显性效应相关性的基因组模型。

Genetics. 2018 Jul;209(3):711-723. doi: 10.1534/genetics.118.301015. Epub 2018 May 9.

Non-additive Effects in Genomic Selection.基因组选择中的非加性效应。

Front Genet. 2018 Mar 6;9:78. doi: 10.3389/fgene.2018.00078. eCollection 2018.

Influence of epistasis on response to genomic selection using complete sequence data.上位性对利用全序列数据进行基因组选择的响应的影响。

Genet Sel Evol. 2017 Aug 25;49(1):66. doi: 10.1186/s12711-017-0340-3.

Incorporating Gene Annotation into Genomic Prediction of Complex Phenotypes.将基因注释纳入复杂性状的基因组预测

Genetics. 2017 Oct;207(2):489-501. doi: 10.1534/genetics.117.300198. Epub 2017 Aug 24.

Single-Step Genomic and Pedigree Genotype × Environment Interaction Models for Predicting Wheat Lines in International Environments.单步基因组和系谱基因型×环境互作模型预测国际环境中的小麦品系。

Plant Genome. 2017 Jul;10(2). doi: 10.3835/plantgenome2016.09.0089.

Epistasis: Searching for Interacting Genetic Variants Using Crosses.上位性：利用杂交寻找相互作用的基因变异

Genetics. 2017 Jun;206(2):531-535. doi: 10.1534/genetics.117.203059.

Orthogonal Estimates of Variances for Additive, Dominance, and Epistatic Effects in Populations.群体中加性、显性和上位性效应方差的正交估计

Genetics. 2017 Jul;206(3):1297-1307. doi: 10.1534/genetics.116.199406. Epub 2017 May 18.

Response to Selection in Finite Locus Models with Nonadditive Effects.具有非加性效应的有限位点模型中的选择响应

J Hered. 2017 May 1;108(3):318-327. doi: 10.1093/jhered/esw123.

BMC Bioinformatics. 2017 Jan 3;18(1):3. doi: 10.1186/s12859-016-1439-1.

Genomic Prediction with Pedigree and Genotype × Environment Interaction in Spring Wheat Grown in South and West Asia, North Africa, and Mexico.在南亚、西亚、北非和墨西哥种植的春小麦中，利用系谱和基因型×环境互作进行基因组预测

G3 (Bethesda). 2017 Feb 9;7(2):481-495. doi: 10.1534/g3.116.036251.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验