Shah Denis A, De Wolf Erick D, Paul Pierce A, Madden Laurence V
Department of Plant Pathology, Kansas State University, Manhattan, Kansas, United States of America.
Department of Plant Pathology, The Ohio State University, Ohio Agricultural Research and Development Center, Wooster, Ohio, United States of America.
PLoS Comput Biol. 2021 Mar 15;17(3):e1008831. doi: 10.1371/journal.pcbi.1008831. eCollection 2021 Mar.
Ensembling combines the predictions made by individual component base models with the goal of achieving a predictive accuracy that is better than that of any one of the constituent member models. Diversity among the base models in terms of predictions is a crucial criterion in ensembling. However, there are practical instances when the available base models produce highly correlated predictions, because they may have been developed within the same research group or may have been built from the same underlying algorithm. We investigated, via a case study on Fusarium head blight (FHB) on wheat in the U.S., whether ensembles of simple yet highly correlated models for predicting the risk of FHB epidemics, all generated from logistic regression, provided any benefit to predictive performance, despite relatively low levels of base model diversity. Three ensembling methods were explored: soft voting, weighted averaging of smaller subsets of the base models, and penalized regression as a stacking algorithm. Soft voting and weighted model averages were generally better at classification than the base models, though not universally so. The performances of stacked regressions were superior to those of the other two ensembling methods we analyzed in this study. Ensembling simple yet correlated models is computationally feasible and is therefore worth pursuing for models of epidemic risk.
集成是将各个组件基础模型所做的预测结合起来,目标是实现比任何一个组成成员模型都更高的预测精度。基础模型在预测方面的多样性是集成中的一个关键标准。然而,在实际情况中,可用的基础模型可能会产生高度相关的预测,因为它们可能是在同一个研究小组中开发的,或者是基于相同的底层算法构建的。我们通过对美国小麦赤霉病(FHB)的案例研究,调查了尽管基础模型多样性水平相对较低,但由逻辑回归生成的用于预测FHB流行风险的简单但高度相关的模型集成是否对预测性能有任何益处。我们探索了三种集成方法:软投票、基础模型较小子集的加权平均以及作为堆叠算法的惩罚回归。软投票和加权模型平均在分类方面通常比基础模型更好,但并非普遍如此。堆叠回归的性能优于我们在本研究中分析的其他两种集成方法。集成简单但相关的模型在计算上是可行的,因此对于流行风险模型来说值得探索。