Suppr超能文献

贝叶斯套索法正则化参数不同确定方法对基因组预测准确性的影响。

The effect of different approaches to determining the regularization parameter of bayesian LASSO on the accuracy of genomic prediction.

作者信息

Sahebalam Hamid, Gholizadeh Mohsen, Hafezian Seyed Hassan

机构信息

Department of Animal Science, Faculty of Animal and Aquatic Science, Sari Agricultural Sciences and Natural Resources University, Sari, Iran.

出版信息

Mamm Genome. 2025 Mar;36(1):331-345. doi: 10.1007/s00335-024-10088-7. Epub 2024 Dec 11.

Abstract

Using dense genomic markers opens up new opportunities and challenges for breeding programs. The need to penalize marker-specific regression coefficients becomes particularly important when dense markers are available. Therefore, fitting the marker effects to observations using a regularization technique, such as Bayesian LASSO (BL) regression, is of great interesting. When the Laplace prior distribution is applied to the regression coefficients, BL can be interpreted as a regularization of the norm based on the Bayesian approach. A critical issue is the appropriate selection of hyperparameters values in the prior distributions of regularization techniques, as these values essentially control the sparsity in the estimated model. The purpose of this study was to evaluate different approaches for selecting the regularization parameter in BL, based on fully Bayesian approaches-such as gamma prior (BL_Gamma), beta prior (BL_Beta) and fixed prior (BL_Fixed) as well as data-driven approaches like cross-validation based on mean square error (BL_CV_MSE) and prediction accuracy (BL_CV_PA). Additionally, information-criteria-based methods including Akaike's information criterion (BL_AIC), Bayesian information criterion (BL_BIC) and Deviance information criterion (BL_DIC), were explored. For this purpose, a genome containing eight chromosomes (each 1 Morgan in length) with 100 randomly distributed quantitative trait loci was simulated. The studied scenarios were as follows: Scenario 1 involved 4000 markers and heritability of 0.2, scenario 2 involved 4000 markers and heritability of 0.6, scenario 3 involved 16,000 markers and heritability of 0.2; and scenario 4 involved 16,000 markers and heritability of 0.6. The results showed that among the fully Bayesian and cross-validation approaches, BL_Gamma, BL_Beta, and BL_CV_MSE provided the highest prediction accuracy (PA) in scenario 1 and 3. With increased marker density and heritability (scenario 4), the cross-validation approaches performed slightly better. The information-criteria-based methods demonstrated the lowest PA. Increasing heritability and marker density led to a decrease and an increase in the model penalty on the regression coefficients, respectively. The PA obtained in the target population ranged from 0.210 to 0.413 in Scenario 1, 0.402 to 0.600 in Scenario 2, 0.256 to 0.442 in Scenario 3, and 0.478 to 0.653 in Scenario 4. In generally, fully Bayesian approaches based on random priors for the regularization parameter are recommended for BL, as they provide acceptable PA with lower computational loads.

摘要

使用高密度基因组标记为育种计划带来了新的机遇和挑战。当有高密度标记可用时,对标记特异性回归系数进行惩罚的需求变得尤为重要。因此,使用正则化技术(如贝叶斯最小绝对收缩和选择算子(BL)回归)将标记效应拟合到观测值上非常有趣。当将拉普拉斯先验分布应用于回归系数时,BL可以被解释为基于贝叶斯方法的 范数正则化。一个关键问题是正则化技术先验分布中超参数值的适当选择,因为这些值本质上控制了估计模型中的稀疏性。本研究的目的是基于完全贝叶斯方法(如伽马先验(BL_Gamma)、贝塔先验(BL_Beta)和固定先验(BL_Fixed))以及数据驱动方法(如基于均方误差的交叉验证(BL_CV_MSE)和预测准确性(BL_CV_PA))来评估在BL中选择正则化参数的不同方法。此外,还探索了基于信息准则的方法,包括赤池信息准则(BL_AIC)、贝叶斯信息准则(BL_BIC)和离差信息准则(BL_DIC)。为此,模拟了一个包含八条染色体(每条长度为1摩根)且有100个随机分布的数量性状位点的基因组。研究的情景如下:情景1包含4000个标记,遗传力为0.2;情景2包含4000个标记,遗传力为0.6;情景3包含16000个标记,遗传力为0.2;情景4包含16000个标记,遗传力为0.6。结果表明,在完全贝叶斯方法和交叉验证方法中,BL_Gamma、BL_Beta和BL_CV_MSE在情景1和3中提供了最高的预测准确性(PA)。随着标记密度和遗传力的增加(情景4),交叉验证方法的表现略好。基于信息准则的方法显示出最低的PA。遗传力的增加和标记密度的增加分别导致回归系数的模型惩罚减少和增加。在情景1中,目标群体中获得的PA范围为0.210至0.413;情景2中为0.402至0.600;情景3中为0.256至0.442;情景4中为0.478至0.653。一般来说,对于BL,建议基于正则化参数的随机先验的完全贝叶斯方法,因为它们以较低的计算量提供了可接受的PA。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验