Facultad de Telemática, Universidad de Colima, 28040, Colima, México.
Statistics Study Program, Universitas Negeri Yogyakarta, 55281, Yogyakarta, Indonesia.
BMC Genomics. 2023 Apr 26;24(1):220. doi: 10.1186/s12864-023-09294-5.
Genomic selection (GS) is revolutionizing plant and animal breeding. However, still its practical implementation is challenging since it is affected by many factors that when they are not under control make this methodology not effective. Also, due to the fact that it is formulated as a regression problem in general has low sensitivity to select the best candidate individuals since a top percentage is selected according to a ranking of predicted breeding values.
For this reason, in this paper we propose two methods to improve the prediction accuracy of this methodology. One of the methods consist in reformulating the GS (nowadays formulated as a regression problem) methodology as a binary classification problem. The other consists only in a postprocessing step that adjust the threshold used for classification of the lines predicted in its original scale (continues scale) to guarantee similar sensitivity and specificity. The postprocessing method is applied for the resulting predictions after obtaining the predictions using the conventional regression model. Both methods assume that we defined with anticipation a threshold, to divide the training data as top lines and not top lines, and this threshold can be decided in terms of a quantile (for example 80%, 90%, etc.) or as the average (or maximum) of the performance of the checks. In the reformulation method it is required to label as one those lines in the training set that are equal or larger than the specified threshold and as zero otherwise. Then we train a binary classification model with the conventional inputs, but using the binary response variable in place of the continuous response variable. The training of the binary classification should be done to guarantee a more similar sensitivity and specificity, to guarantee a reasonable probability of classification of the top lines.
We evaluated the proposed models in seven data sets and we found that the two proposed methods outperformed by large margin the conventional regression model (by 402.9% in terms of sensitivity, by 110.04% in terms of F1 score and by 70.96% in terms of Kappa coefficient, with the postprocessing methods). However, between the two proposed methods the postprocessing method was better than the reformulation as binary classification model. The simple postprocessing method to improve the accuracy of the conventional genomic regression models avoid the need to reformulate the conventional regression models as binary classification models with similar or better performance, that significantly improve the selection of the top best candidate lines. In general both proposed methods are simple and can easily be adopted for use in practical breeding programs, with the guarantee that will improve significantly the selection of the top best candidates lines.
基因组选择(GS)正在彻底改变植物和动物的育种。然而,由于其受到许多因素的影响,而这些因素如果得不到控制,就会使这种方法变得不那么有效,因此其实际实施仍然具有挑战性。此外,由于它通常被表述为一个回归问题,因此对于选择最佳候选个体的敏感性较低,因为根据预测的育种值的排名,只选择了一个最高的百分比。
为此,本文提出了两种改进该方法预测准确性的方法。其中一种方法是将 GS(目前表述为回归问题)方法重新表述为二分类问题。另一种方法只是一个后处理步骤,用于调整用于对其原始比例(连续比例)中预测的线进行分类的阈值,以保证类似的敏感性和特异性。该后处理方法应用于使用传统回归模型获得预测之后的结果预测。这两种方法都假设我们提前定义了一个阈值,将训练数据分为顶级线和非顶级线,并且可以根据分位数(例如 80%、90%等)或作为检查结果的平均值(或最大值)来决定该阈值。在重新表述方法中,需要将训练集中等于或大于指定阈值的线标记为 1,将其他线标记为 0。然后,我们使用传统输入但使用二进制响应变量而不是连续响应变量来训练二进制分类模型。二进制分类的训练应该保证更相似的敏感性和特异性,以保证对顶级线的合理分类概率。
我们在七个数据集上评估了所提出的模型,发现这两种方法都比传统的回归模型有很大的优势(在敏感性方面提高了 402.9%,在 F1 分数方面提高了 110.04%,在 Kappa 系数方面提高了 70.96%)。然而,在这两种方法中,后处理方法比重新表述为二进制分类模型更好。改进传统基因组回归模型准确性的简单后处理方法避免了将传统回归模型重新表述为具有相似或更好性能的二进制分类模型的需要,这显著提高了顶级最佳候选线的选择。一般来说,这两种方法都很简单,可以很容易地应用于实际的育种计划,并保证会显著提高顶级最佳候选线的选择。