Ramazi Pouria, Kunegel-Lion Mélodie, Greiner Russell, Lewis Mark A
Department of Mathematical and Statistical Sciences University of Alberta Edmonton AB Canada.
Department of Computing Science University of Alberta Edmonton AB Canada.
Ecol Evol. 2021 Sep 12;11(19):13014-13028. doi: 10.1002/ece3.7921. eCollection 2021 Oct.
Planning forest management relies on predicting insect outbreaks such as mountain pine beetle, particularly in the intermediate-term future, e.g., 5-year. Machine-learning algorithms are potential solutions to this challenging problem due to their many successes across a variety of prediction tasks. However, there are many subtle challenges in applying them: identifying the best learning models and the best subset of available covariates (including time lags) and properly evaluating the models to avoid misleading performance-measures. We systematically address these issues in predicting the chance of a mountain pine beetle outbreak in the Cypress Hills area and seek models with the best performance at predicting future 1-, 3-, 5- and 7-year infestations. We train nine machine-learning models, including two generalized boosted regression trees (GBM) that predict future 1- and 3-year infestations with 92% and 88% AUC, and two novel mixed models that predict future 5- and 7-year infestations with 86% and 84% AUC, respectively. We also consider forming the train and test datasets by splitting the original dataset rather than using the appropriate year-based approach and show that this may obtain models that score high on the test dataset but low in practice, resulting in inaccurate performance evaluations. For example, a -nearest neighbor model with the actual performance of 68% AUC, scores the misleadingly high 78% on a test dataset obtained from a random split, but the more accurate 66% on a year-based split. We then investigate how the prediction accuracy varies with respect to the provided history length of the covariates and find that neural network and naive Bayes, predict more accurately as history-length increases, particularly for future 1- and 3-year predictions, and roughly the same holds with GBM. Our approach is applicable to other invasive species. The resulting predictors can be used in planning forest and pest management and planning sampling locations in field studies.
规划森林管理依赖于预测诸如山地松甲虫等昆虫的爆发,尤其是在中期未来,例如5年。机器学习算法因其在各种预测任务中取得的诸多成功而成为解决这一具有挑战性问题的潜在方案。然而,应用它们存在许多细微的挑战:识别最佳学习模型和可用协变量(包括时间滞后)的最佳子集,并正确评估模型以避免误导性的性能度量。我们系统地解决了在预测柏树山地区山地松甲虫爆发可能性时的这些问题,并寻求在预测未来1年、3年、5年和7年虫害方面具有最佳性能的模型。我们训练了九个机器学习模型,包括两个广义增强回归树(GBM),它们预测未来1年和3年虫害的曲线下面积(AUC)分别为92%和88%,以及两个新颖的混合模型,它们预测未来5年和7年虫害的AUC分别为86%和84%。我们还考虑通过分割原始数据集而不是使用适当的基于年份的方法来形成训练和测试数据集,并表明这可能会得到在测试数据集上得分高但在实际中得分低的模型,从而导致不准确的性能评估。例如,一个实际AUC为68%的最近邻模型,在从随机分割获得的测试数据集上得分高达78%,但在基于年份的分割上得分更准确的66%。然后,我们研究预测准确性如何随协变量的提供历史长度而变化,发现神经网络和朴素贝叶斯随着历史长度的增加预测更准确,特别是对于未来1年和3年的预测,GBM大致也是如此。我们的方法适用于其他入侵物种。所得的预测器可用于规划森林和害虫管理以及实地研究中的采样地点规划。