基于数据驱动的定量遗传性状预测编码。

Data-driven encoding for quantitative genetic trait prediction.

出版信息

BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-16-S1-S10. Epub 2015 Feb 18.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC4571493/

Abstract

MOTIVATION

Given a set of biallelic molecular markers, such as SNPs, with genotype values on a collection of plant, animal or human samples, the goal of quantitative genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous alleles, are usually coded as integers, and manipulated algebraically in the model. Further, epistasis between multiple markers is modeled as multiplication between the markers: it is unclear that the regression model continues to be effective under this. In this work we investigate the effects of encodings to the quantitative genetic trait prediction problem.

RESULTS

We first showed that different encodings lead to different prediction accuracies, in many test cases. We then proposed a data-driven encoding strategy, where we encode the genotypes according to their distribution in the phenotypes and we allow each marker to have different encodings. We show in our experiments that this encoding strategy is able to improve the performance of the genetic trait prediction method and it is more helpful for the oligogenic traits, whose values rely on a relatively small set of markers. To the best of our knowledge, this is the first paper that discusses the effects of encodings to the genetic trait prediction problem.

摘要

动机

给定一组双等位基因分子标记，如 SNP，在植物、动物或人类样本的集合上具有基因型值，数量遗传性状预测的目标是通过同时对所有标记效应进行建模来预测数量性状值。数量遗传性状预测通常表示为线性回归模型，该模型需要对基因型进行定量编码：三个不同的基因型值，对应一个杂合子和两个纯合子等位基因，通常编码为整数，并在模型中进行代数操作。此外，多个标记之间的上位性被建模为标记之间的乘法：在这种情况下，回归模型是否仍然有效尚不清楚。在这项工作中，我们研究了编码对数量遗传性状预测问题的影响。

结果

我们首先表明，不同的编码导致不同的预测准确性，在许多测试案例中。然后，我们提出了一种数据驱动的编码策略，根据表型中基因型的分布对基因型进行编码，并允许每个标记具有不同的编码。我们在实验中表明，这种编码策略能够提高遗传性状预测方法的性能，并且对于依赖相对较小数量标记的多基因性状更有帮助。据我们所知，这是第一篇讨论编码对遗传性状预测问题的影响的论文。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e8b9/4571493/22ead9e79aaa/1471-2105-16-S1-S10-1.jpg

相似文献

Data-driven encoding for quantitative genetic trait prediction.基于数据驱动的定量遗传性状预测编码。

BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-16-S1-S10. Epub 2015 Feb 18.

Does encoding matter? A novel view on the quantitative genetic trait prediction problem.编码重要吗？关于数量遗传性状预测问题的新观点。

BMC Bioinformatics. 2016 Jul 19;17 Suppl 9(Suppl 9):272. doi: 10.1186/s12859-016-1127-1.

Novel applications of multitask learning and multiple output regression to multiple genetic trait prediction.多任务学习和多输出回归在多基因性状预测中的新应用。

Bioinformatics. 2016 Jun 15;32(12):i37-i43. doi: 10.1093/bioinformatics/btw249.

Accuracy of prediction of simulated polygenic phenotypes and their underlying quantitative trait loci genotypes using real or imputed whole-genome markers in cattle.利用真实或推算的全基因组标记预测牛模拟多基因表型及其潜在数量性状位点基因型的准确性。

Genet Sel Evol. 2015 Dec 23;47:99. doi: 10.1186/s12711-015-0179-4.

MUSE: A MULTI-LOCUS SAMPLING-BASED EPISTASIS ALGORITHM FOR QUANTITATIVE GENETIC TRAIT PREDICTION.MUSE：一种基于多位点抽样的上位性算法，用于定量遗传性状预测。

Pac Symp Biocomput. 2017;22:426-437. doi: 10.1142/9789813207813_0040.

Implications of using genomic prediction within a high-density SNP dataset to predict DUS traits in barley.在高密度单核苷酸多态性（SNP）数据集中使用基因组预测来预测大麦的特异性植物学特征的意义。

Theor Appl Genet. 2015 Dec;128(12):2461-70. doi: 10.1007/s00122-015-2601-2. Epub 2015 Sep 8.

Empirical and deterministic accuracies of across-population genomic prediction.跨群体基因组预测的经验性和确定性准确性。

Genet Sel Evol. 2015 Feb 6;47(1):5. doi: 10.1186/s12711-014-0086-0.

Genetic evaluation with major genes and polygenic inheritance when some animals are not genotyped using gene content multiple-trait BLUP.当一些动物未使用基因含量多性状BLUP进行基因分型时，利用主基因和多基因遗传进行遗传评估。

Genet Sel Evol. 2015 Nov 17;47:89. doi: 10.1186/s12711-015-0165-x.

Genomic prediction of simulated multibreed and purebred performance using observed fifty thousand single nucleotide polymorphism genotypes.利用观测到的五万个性状 SNP 基因型对模拟多品种和纯种表现进行基因组预测。

J Anim Sci. 2010 Feb;88(2):544-51. doi: 10.2527/jas.2009-2064. Epub 2009 Oct 9.

Determination of the optimal number of markers and individuals in a training population necessary for maximum prediction accuracy in F populations by using genomic selection models.利用基因组选择模型确定F群体中为实现最大预测准确性所需的训练群体中标记和个体的最佳数量。

Genet Mol Res. 2016 Nov 21;15(4):gmr-15-04-gmr.15048874. doi: 10.4238/gmr15048874.

引用本文的文献

The HUNT lung-SNP model: genetic variants plus clinical variables improve lung cancer risk assessment over clinical models.HUNT 肺-SNP 模型：遗传变异与临床变量相结合，可提高肺癌风险评估的准确性优于临床模型。

J Cancer Res Clin Oncol. 2024 Aug 12;150(8):389. doi: 10.1007/s00432-024-05909-w.

A consistent approach to the genotype encoding problem in a genome-wide association study of continuous phenotypes.全基因组关联研究中连续表型的基因型编码问题的一致性方法。

PLoS One. 2020 Jul 15;15(7):e0236139. doi: 10.1371/journal.pone.0236139. eCollection 2020.

Incorporating Genome Annotation Into Genomic Prediction for Carcass Traits in Chinese Simmental Beef Cattle.将基因组注释纳入中国西门塔尔牛胴体性状的基因组预测

Front Genet. 2020 May 15;11:481. doi: 10.3389/fgene.2020.00481. eCollection 2020.

Homeologous Epistasis in Wheat: The Search for an Immortal Hybrid.小麦同源上位性：寻找不朽杂种。

Genetics. 2019 Mar;211(3):1105-1122. doi: 10.1534/genetics.118.301851. Epub 2019 Jan 24.

Detection of Epistasis for Flowering Time Using Bayesian Multilocus Estimation in a Barley MAGIC Population.利用大麦 MAGIC 群体中的贝叶斯多位点估计检测花期的上位性。

Genetics. 2018 Feb;208(2):525-536. doi: 10.1534/genetics.117.300546. Epub 2017 Dec 18.

Influence of epistasis on response to genomic selection using complete sequence data.上位性对利用全序列数据进行基因组选择的响应的影响。

Genet Sel Evol. 2017 Aug 25;49(1):66. doi: 10.1186/s12711-017-0340-3.

Genomic prediction with epistasis models: on the marker-coding-dependent performance of the extended GBLUP and properties of the categorical epistasis model (CE).基于上位性模型的基因组预测：关于扩展GBLUP的标记编码依赖性性能及分类上位性模型（CE）的性质

BMC Bioinformatics. 2017 Jan 3;18(1):3. doi: 10.1186/s12859-016-1439-1.

An Efficient Nonlinear Regression Approach for Genome-wide Detection of Marginal and Interacting Genetic Variations.一种用于全基因组检测边缘和相互作用遗传变异的高效非线性回归方法。

J Comput Biol. 2016 May;23(5):372-89. doi: 10.1089/cmb.2015.0202.

Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins.用于预测和解释多标签蛋白质亚细胞定位的稀疏回归

BMC Bioinformatics. 2016 Feb 24;17:97. doi: 10.1186/s12859-016-0940-x.

Epistasis and covariance: how gene interaction translates into genomic relationship.上位性和协方差：基因互作如何转化为基因组关系。

Theor Appl Genet. 2016 May;129(5):963-76. doi: 10.1007/s00122-016-2675-5. Epub 2016 Feb 16.

本文引用的文献

Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.).通过优化参考个体的校准集来提高基因组选择的可靠性：两种不同群体的玉米自交系（Zea mays L.）中的方法比较。

Genetics. 2012 Oct;192(2):715-28. doi: 10.1534/genetics.112.141473. Epub 2012 Aug 3.

A common dataset for genomic analysis of livestock populations.一个用于家畜群体基因组分析的常见数据集。

G3 (Bethesda). 2012 Apr;2(4):429-35. doi: 10.1534/g3.111.001453. Epub 2012 Apr 1.

High-order SNP combinations associated with complex diseases: efficient discovery, statistical power and functional interactions.与复杂疾病相关的高阶 SNP 组合：高效发现、统计能力和功能相互作用。

PLoS One. 2012;7(4):e33531. doi: 10.1371/journal.pone.0033531. Epub 2012 Apr 19.

Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa.全基因组关联作图揭示了水稻复杂性状的丰富遗传结构。

Nat Commun. 2011 Sep 13;2:467. doi: 10.1038/ncomms1467.

Improved Lasso for genomic selection.用于基因组选择的改进套索法

Genet Res (Camb). 2011 Feb;93(1):77-87. doi: 10.1017/S0016672310000534. Epub 2010 Dec 14.

Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径

J Stat Softw. 2010;33(1):1-22.

TEAM: efficient two-locus epistasis tests in human genome-wide association study.TEAM：人类全基因组关联研究中高效的双位点上位性检验。

Bioinformatics. 2010 Jun 15;26(12):i217-27. doi: 10.1093/bioinformatics/btq186.

Genomic selection in plant breeding: from theory to practice.植物育种中的基因组选择：从理论到实践。

Brief Funct Genomics. 2010 Mar;9(2):166-77. doi: 10.1093/bfgp/elq001. Epub 2010 Feb 15.

J Anim Sci. 2010 Feb;88(2):544-51. doi: 10.2527/jas.2009-2064. Epub 2009 Oct 9.

Invited review: Genomic selection in dairy cattle: progress and challenges.特邀综述：奶牛的基因组选择：进展与挑战

J Dairy Sci. 2009 Feb;92(2):433-43. doi: 10.3168/jds.2008-1646.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于数据驱动的定量遗传性状预测编码。

Data-driven encoding for quantitative genetic trait prediction.

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献