Piles Miriam, Bergsma Rob, Gianola Daniel, Gilbert Hélène, Tusell Llibertat
Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, Spain.
Topigs Norsvin Research Center, Beuningen, Netherlands.
Front Genet. 2021 Feb 22;12:611506. doi: 10.3389/fgene.2021.611506. eCollection 2021.
Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal's own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000-1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50-250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.
特征选择(即选择预测变量的一个子集)在高维数据集中至关重要,可防止预测/分类模型过度拟合,并减少计算时间和资源。在基因组学中,特征选择有助于识别相关标记并设计低密度SNP芯片以评估候选选择。在本研究中,几种单变量和多变量特征选择算法与各种参数和非参数学习器相结合,应用于从高维基因组数据预测生长猪的饲料效率。目的是找到特征选择器、SNP子集大小和学习器的最佳组合,从而得到准确且稳定(即对训练数据变化不太敏感)的预测模型。未进行SNP预选择的基因组最佳线性无偏预测(GBLUP)作为基准。实施了三种类型的特征选择方法:(i)过滤方法:单变量(单变量决策树、斯皮尔曼相关性)或多变量(随机森林、最大相关最小冗余),以随机选择作为基准;(ii)嵌入式方法:弹性网络和最小绝对收缩和选择算子(LASSO)回归;(iii)过滤方法和嵌入式方法的组合。在用过滤方法进行预选择后,应用岭回归、支持向量机(SVM)和梯度提升(GB)。数据代表了5708条个体记录,这些记录是根据动物自身的基因型预测的剩余饲料摄入量。在10折交叉验证中,准确性(结果的稳定性)通过观察数据和预测数据之间的斯皮尔曼相关性的中位数(四分位间距)来衡量。使用500个或更多SNP时,SVM和GB在准确性和稳定性方面取得了最佳预测结果(使用1000个SNP时,SVM和GB的斯皮尔曼相关性分别为0.28(0.02)和0.27(0.04))。对于较大的子集大小(1000 - 1500个SNP),过滤方法对预测质量没有影响,这与随机选择的结果相似。当有50 - 250个SNP时,特征选择方法对预测质量有巨大影响:对于与任何学习器结合的基于树的方法来说预测质量非常差,但当实施斯皮尔曼相关性或最大相关最小冗余(无论是否结合嵌入式方法)时,预测质量良好且与使用较大SNP子集时相似。这些过滤器还产生了非常稳定的结果,表明它们在设计用于基于基因组评估饲料效率的低密度SNP芯片方面具有潜在用途。