Milton Jacqueline N, Steinberg Martin H, Sebastiani Paola
Department of Biostatistics, School of Public Health, Boston University Boston, MA, USA.
Department of Medicine, School of Medicine, Boston University Boston, MA, USA.
Front Genet. 2015 Jan 13;5:474. doi: 10.3389/fgene.2014.00474. eCollection 2014.
Many genetic markers have been shown to be associated with common quantitative traits in genome-wide association studies. Typically these associated genetic markers have small to modest effect sizes and individually they explain only a small amount of the variability of the phenotype. In order to build a genetic prediction model without fitting a multiple linear regression model with possibly hundreds of genetic markers as predictors, researchers often summarize the joint effect of risk alleles into a genetic score that is used as a covariate in the genetic prediction model. However, the prediction accuracy can be highly variable and selecting the optimal number of markers to be included in the genetic score is challenging. In this manuscript we present a strategy to build an ensemble of genetic prediction models from data and we show that the ensemble-based method makes the challenge of choosing the number of genetic markers more amenable. Using simulated data with varying heritability and number of genetic markers, we compare the predictive accuracy and inclusion of true positive and false positive markers of a single genetic prediction model and our proposed ensemble method. The results show that the ensemble of genetic models tends to include a larger number of genetic variants than a single genetic model and it is more likely to include all of the true genetic markers. This increased sensitivity is obtained at the price of a lower specificity that appears to minimally affect the predictive accuracy of the ensemble.
在全基因组关联研究中,许多遗传标记已被证明与常见的数量性状相关。通常,这些相关的遗传标记效应大小较小到中等,单个标记只能解释表型变异的一小部分。为了构建一个遗传预测模型,而无需将可能数百个遗传标记作为预测变量拟合多元线性回归模型,研究人员通常将风险等位基因的联合效应总结为一个遗传评分,该评分在遗传预测模型中用作协变量。然而,预测准确性可能高度可变,选择纳入遗传评分的最佳标记数量具有挑战性。在本手稿中,我们提出了一种从数据构建遗传预测模型集合的策略,并且我们表明基于集合的方法使选择遗传标记数量的挑战更易于处理。使用具有不同遗传力和遗传标记数量的模拟数据,我们比较了单个遗传预测模型和我们提出的集合方法的预测准确性以及真阳性和假阳性标记的包含情况。结果表明,遗传模型集合往往比单个遗传模型包含更多数量的遗传变异,并且更有可能包含所有真正的遗传标记。这种灵敏度的提高是以较低的特异性为代价获得的,而这似乎对集合的预测准确性影响最小。