Merrick Lance F, Lozada Dennis N, Chen Xianming, Carter Arron H
Department of Crop and Soil Sciences, Washington State University, Pullman, WA, United States.
Department of Plant and Environmental Sciences, New Mexico State University, Las Cruces, NM, United States.
Front Genet. 2022 Feb 23;13:835781. doi: 10.3389/fgene.2022.835781. eCollection 2022.
Most genomic prediction models are linear regression models that assume continuous and normally distributed phenotypes, but responses to diseases such as stripe rust (caused by f. sp. tritici) are commonly recorded in ordinal scales and percentages. Disease severity (SEV) and infection type (IT) data in germplasm screening nurseries generally do not follow these assumptions. On this regard, researchers may ignore the lack of normality, transform the phenotypes, use generalized linear models, or use supervised learning algorithms and classification models with no restriction on the distribution of response variables, which are less sensitive when modeling ordinal scores. The goal of this research was to compare classification and regression genomic selection models for skewed phenotypes using stripe rust SEV and IT in winter wheat. We extensively compared both regression and classification prediction models using two training populations composed of breeding lines phenotyped in 4 years (2016-2018 and 2020) and a diversity panel phenotyped in 4 years (2013-2016). The prediction models used 19,861 genotyping-by-sequencing single-nucleotide polymorphism markers. Overall, square root transformed phenotypes using ridge regression best linear unbiased prediction and support vector machine regression models displayed the highest combination of accuracy and relative efficiency across the regression and classification models. Furthermore, a classification system based on support vector machine and ordinal Bayesian models with a 2-Class scale for SEV reached the highest class accuracy of 0.99. This study showed that breeders can use linear and non-parametric regression models within their own breeding lines over combined years to accurately predict skewed phenotypes.
大多数基因组预测模型都是线性回归模型,这些模型假定表型是连续且呈正态分布的,但对诸如条锈病(由小麦条锈菌引起)等疾病的反应通常以有序尺度和百分比来记录。种质筛选苗圃中的病情严重程度(SEV)和感染类型(IT)数据通常并不符合这些假设。在这方面,研究人员可能会忽略正态性的缺失、对表型进行转换、使用广义线性模型,或者使用对响应变量分布没有限制的监督学习算法和分类模型,这些模型在对有序分数进行建模时不太敏感。本研究的目的是使用冬小麦的条锈病SEV和IT,比较针对偏态表型的分类和回归基因组选择模型。我们使用了两个训练群体广泛比较了回归和分类预测模型,一个训练群体由在4年(2016 - 2018年和2020年)进行表型分析的育种系组成,另一个是在4年(2013 - 2016年)进行表型分析的多样性面板。预测模型使用了19,861个通过测序进行基因分型的单核苷酸多态性标记。总体而言,使用岭回归最佳线性无偏预测和支持向量机回归模型对表型进行平方根转换后,在回归和分类模型中显示出了最高的准确性和相对效率组合。此外,基于支持向量机和有序贝叶斯模型的分类系统,对于SEV采用2级尺度,达到了最高的类别准确率0.99。这项研究表明,育种者可以在多年的育种系中使用线性和非参数回归模型来准确预测偏态表型。