School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland; Teagasc, Animal & Grassland Research and Innovation Centre, Moorepark, Fermoy, Co. Cork, P61 P302 Ireland.
School of Mathematics and Statistics, University College Dublin, Belfield, Dublin 4, Ireland.
J Dairy Sci. 2021 Jul;104(7):7438-7447. doi: 10.3168/jds.2020-19576. Epub 2021 Apr 15.
Numerous statistical machine learning methods suitable for application to highly correlated features, as those that exist for spectral data, could potentially improve prediction performance over the commonly used partial least squares approach. Milk samples from 622 individual cows with known detailed protein composition and technological trait data accompanied by mid-infrared spectra were available to assess the predictive ability of different regression and classification algorithms. The regression-based approaches were partial least squares regression (PLSR), ridge regression (RR), least absolute shrinkage and selection operator (LASSO), elastic net, principal component regression, projection pursuit regression, spike and slab regression, random forests, boosting decision trees, neural networks (NN), and a post-hoc approach of model averaging (MA). Several classification methods (i.e., partial least squares discriminant analysis (PLSDA), random forests, boosting decision trees, and support vector machines (SVM)) were also used after stratifying the traits of interest into categories. In the regression analyses, MA was the best prediction method for 6 of the 14 traits investigated [curd firmness at 60 min, α-casein (CN), α-CN, κ-CN, α-lactalbumin, and β-lactoglobulin B], whereas NN and RR were the best algorithms for 3 traits each (rennet coagulation time, curd-firming time, and heat stability, and curd firmness at 30 min, β-CN, and β-lactoglobulin A, respectively), PLSR was best for pH, and LASSO was best for CN micelle size. When traits were divided into 2 classes, SVM had the greatest accuracy for the majority of the traits investigated. Although the well-established PLSR-based method performed competitively, the application of statistical machine learning methods for regression analyses reduced the root mean square error compared with PLSR from between 0.18% (κ-CN) to 3.67% (heat stability). The use of modern statistical machine learning methods for trait prediction from mid-infrared spectroscopy may improve the prediction accuracy for some traits.
大量适用于高度相关特征的统计机器学习方法,如适用于光谱数据的方法,可能会提高预测性能,优于常用的偏最小二乘方法。有 622 头奶牛的牛奶样本具有已知的详细蛋白质组成和技术性状数据,并附有中红外光谱,用于评估不同回归和分类算法的预测能力。基于回归的方法有偏最小二乘回归(PLSR)、岭回归(RR)、最小绝对收缩和选择算子(LASSO)、弹性网络、主成分回归、投影寻踪回归、尖峰和板回归、随机森林、提升决策树、神经网络(NN)和事后模型平均(MA)。在对感兴趣的性状进行分类后,还使用了几种分类方法(即偏最小二乘判别分析(PLSDA)、随机森林、提升决策树和支持向量机(SVM))。在回归分析中,MA 是 14 个研究性状中 6 个性状的最佳预测方法[60 分钟时的凝乳强度、α-酪蛋白(CN)、α-CN、κ-CN、α-乳白蛋白和β-乳球蛋白 B],而 NN 和 RR 是 3 个性状的最佳算法(凝乳酶凝固时间、凝乳时间和热稳定性,以及 30 分钟时的凝乳强度、β-CN 和β-乳球蛋白 A),PLSR 是 pH 的最佳方法,LASSO 是 CN 胶束大小的最佳方法。当性状分为 2 类时,SVM 对大多数研究性状具有最高的准确性。尽管基于 PLSR 的既定方法具有竞争力,但统计机器学习方法在回归分析中的应用与 PLSR 相比,将根均方误差降低了 0.18%(κ-CN)至 3.67%(热稳定性)。从中红外光谱预测性状时使用现代统计机器学习方法可能会提高某些性状的预测准确性。