BMC Bioinformatics. 2015;16 Suppl 1(Suppl 1):S10. doi: 10.1186/1471-2105-16-S1-S10. Epub 2015 Feb 18.
Given a set of biallelic molecular markers, such as SNPs, with genotype values on a collection of plant, animal or human samples, the goal of quantitative genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Quantitative genetic trait prediction is usually represented as linear regression models which require quantitative encodings for the genotypes: the three distinct genotype values, corresponding to one heterozygous and two homozygous alleles, are usually coded as integers, and manipulated algebraically in the model. Further, epistasis between multiple markers is modeled as multiplication between the markers: it is unclear that the regression model continues to be effective under this. In this work we investigate the effects of encodings to the quantitative genetic trait prediction problem.
We first showed that different encodings lead to different prediction accuracies, in many test cases. We then proposed a data-driven encoding strategy, where we encode the genotypes according to their distribution in the phenotypes and we allow each marker to have different encodings. We show in our experiments that this encoding strategy is able to improve the performance of the genetic trait prediction method and it is more helpful for the oligogenic traits, whose values rely on a relatively small set of markers. To the best of our knowledge, this is the first paper that discusses the effects of encodings to the genetic trait prediction problem.
给定一组双等位基因分子标记,如 SNP,在植物、动物或人类样本的集合上具有基因型值,数量遗传性状预测的目标是通过同时对所有标记效应进行建模来预测数量性状值。数量遗传性状预测通常表示为线性回归模型,该模型需要对基因型进行定量编码:三个不同的基因型值,对应一个杂合子和两个纯合子等位基因,通常编码为整数,并在模型中进行代数操作。此外,多个标记之间的上位性被建模为标记之间的乘法:在这种情况下,回归模型是否仍然有效尚不清楚。在这项工作中,我们研究了编码对数量遗传性状预测问题的影响。
我们首先表明,不同的编码导致不同的预测准确性,在许多测试案例中。然后,我们提出了一种数据驱动的编码策略,根据表型中基因型的分布对基因型进行编码,并允许每个标记具有不同的编码。我们在实验中表明,这种编码策略能够提高遗传性状预测方法的性能,并且对于依赖相对较小数量标记的多基因性状更有帮助。据我们所知,这是第一篇讨论编码对遗传性状预测问题的影响的论文。