Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, 14476 Potsdam-Golm, Germany.
Bioinformatics Group, Institute of Biochemistry and Biology, University of Potsdam, 14476 Potsdam-Golm, Germany.
Bioinformatics. 2021 Sep 29;37(18):2896-2904. doi: 10.1093/bioinformatics/btab212.
Genomic selection (GS) is currently deemed the most effective approach to speed up breeding of agricultural varieties. It has been recognized that consideration of multiple traits in GS can improve accuracy of prediction for traits of low heritability. However, since GS forgoes statistical testing with the idea of improving predictions, it does not facilitate mechanistic understanding of the contribution of particular single nucleotide polymorphisms (SNP).
Here, we propose a L2,1-norm regularized multivariate regression model and devise a fast and efficient iterative optimization algorithm, called L2,1-joint, applicable in multi-trait GS. The usage of the L2,1-norm facilitates variable selection in a penalized multivariate regression that considers the relation between individuals, when the number of SNPs is much larger than the number of individuals. The capacity for variable selection allows us to define master regulators that can be used in a multi-trait GS setting to dissect the genetic architecture of the analyzed traits. Our comparative analyses demonstrate that the proposed model is a favorable candidate compared to existing state-of-the-art approaches. Prediction and variable selection with datasets from Brassica napus, wheat and Arabidopsis thaliana diversity panels are conducted to further showcase the performance of the proposed model.
: The model is implemented using R programming language and the code is freely available from https://github.com/alainmbebi/L21-norm-GS.
Supplementary data are available at Bioinformatics online.
基因组选择(GS)目前被认为是加速农业品种选育最有效的方法。人们已经认识到,在 GS 中考虑多个性状可以提高低遗传力性状预测的准确性。然而,由于 GS 放弃了统计检验的想法,以提高预测的准确性,它不利于对特定单核苷酸多态性(SNP)的贡献的机制理解。
在这里,我们提出了一种 L2,1-范数正则化多变量回归模型,并设计了一种快速有效的迭代优化算法,称为 L2,1-联合,适用于多性状 GS。L2,1-范数的使用在考虑个体之间关系的惩罚多变量回归中促进了变量选择,当 SNP 的数量远远大于个体的数量时。变量选择的能力使我们能够定义主调节器,可用于多性状 GS 环境中,以剖析所分析性状的遗传结构。我们的比较分析表明,与现有的最先进的方法相比,所提出的模型是一个有吸引力的候选者。使用 Brassica napus、小麦和 Arabidopsis thaliana 多样性面板的数据集进行预测和变量选择,进一步展示了所提出模型的性能。
该模型使用 R 编程语言实现,代码可从 https://github.com/alainmbebi/L21-norm-GS 免费获得。
补充数据可在生物信息学在线获得。