用于基因组选择的带交叉验证的套索算法。

LASSO with cross-validation for genomic selection.

作者信息

Usai M Graziano, Goddard Mike E, Hayes Ben J

机构信息

Settore Genetica e Biotecnologie, AGRIS-Sardegna, Olmedo 07040, Italy.

出版信息

Genet Res (Camb). 2009 Dec;91(6):427-36. doi: 10.1017/S0016672309990334.

DOI:10.1017/S0016672309990334

PMID:20122298

Abstract

We used a least absolute shrinkage and selection operator (LASSO) approach to estimate marker effects for genomic selection. The least angle regression (LARS) algorithm and cross-validation were used to define the best subset of markers to include in the model. The LASSO-LARS approach was tested on two data sets: a simulated data set with 5865 individuals and 6000 Single Nucleotide Polymorphisms (SNPs); and a mouse data set with 1885 individuals genotyped for 10 656 SNPs and phenotyped for a number of quantitative traits. In the simulated data, three approaches were used to split the reference population into training and validation subsets for cross-validation: random splitting across the whole population; random sampling of validation set from the last generation only, either within or across families. The highest accuracy was obtained by random splitting across the whole population. The accuracy of genomic estimated breeding values (GEBVs) in the candidate population obtained by LASSO-LARS was 0.89 with 156 explanatory SNPs. This value was higher than those obtained by Best Linear Unbiased Prediction (BLUP) and a Bayesian method (BayesA), which were 0.75 and 0.84, respectively. In the mouse data, 1600 individuals were randomly allocated to the reference population. The GEBVs for the remaining 285 individuals estimated by LASSO-LARS were more accurate than those obtained by BLUP and BayesA for weight at six weeks and slightly lower for growth rate and body length. It was concluded that LASSO-LARS approach is a good alternative method to estimate marker effects for genomic selection, particularly when the cost of genotyping can be reduced by using a limited subset of markers.

摘要

我们使用最小绝对收缩与选择算子（LASSO）方法来估计基因组选择的标记效应。采用最小角回归（LARS）算法和交叉验证来确定纳入模型的最佳标记子集。在两个数据集上对LASSO-LARS方法进行了测试：一个模拟数据集，包含5865个个体和6000个单核苷酸多态性（SNP）；一个小鼠数据集，有1885个个体，对10656个SNP进行了基因分型，并对多个数量性状进行了表型分析。在模拟数据中，使用了三种方法将参考群体划分为训练集和验证集以进行交叉验证：在整个群体中随机划分；仅从最后一代中随机抽取验证集，可在家族内或跨家族抽取。通过在整个群体中随机划分获得了最高的准确性。通过LASSO-LARS在候选群体中获得的基因组估计育种值（GEBV）的准确性为0.89，使用了156个解释性SNP。该值高于通过最佳线性无偏预测（BLUP）和贝叶斯方法（BayesA）获得的值，后者分别为0.75和0.84。在小鼠数据中，1600个个体被随机分配到参考群体。LASSO-LARS估计的其余285个个体在六周龄体重方面的GEBV比通过BLUP和BayesA获得的更准确，在生长速率和体长方面略低。得出的结论是，LASSO-LARS方法是估计基因组选择标记效应的一种很好的替代方法，特别是当通过使用有限的标记子集可以降低基因分型成本时。