Wen Zihao, Dowe David L
College of Mathematics and Informatics, South China Agricultural University, No. 483, Wushan Road, Tianhe District, Guangzhou 510642, China.
Department of Data Science and Artificial Intelligence, Faculty of Information Technology, Monash University, Clayton, VIC 3800, Australia.
Entropy (Basel). 2024 Dec 26;27(1):6. doi: 10.3390/e27010006.
Species distribution modeling is fundamental to biodiversity, evolution, conservation science, and the study of invasive species. Given environmental data and species distribution data, model selection techniques are frequently used to help identify relevant features. Existing studies aim to find the relevant features by selecting the best models using different criteria, and they deem the predictors in the best models as the relevant features. However, they mostly consider only a given model family, making them vulnerable to model family misspecification. To address this issue, this paper introduces the Bayesian information-theoretic minimum message length (MML) principle to species distribution model selection. In particular, we provide a framework that allows the message length of models from multiple model families to be calculated and compared, and by doing so, the model selection is both accurate and robust against model family misspecification and data aggregation. To find the relevant features efficiently, we further develop a novel search algorithm that does not require calculating the message length for all possible subsets of features. Experimental results demonstrate that our proposed method outperforms competing methods by selecting the best models on both artificial and real-world datasets. More specifically, there was one test on artificial data that all methods got wrong. On the other 10 tests on artificial data, the MML method got everything correct, but the alternative methods all failed on a variety of tests. Our real-world data pertained to two plant species from Barro Colorado Island, Panama. Compared to the alternative methods, for both the plant species, the MML method selects the simplest model while also having the overall best predictions.
物种分布建模是生物多样性、进化、保护科学以及入侵物种研究的基础。给定环境数据和物种分布数据后,模型选择技术常被用于帮助识别相关特征。现有研究旨在通过使用不同标准选择最佳模型来找到相关特征,并将最佳模型中的预测变量视为相关特征。然而,它们大多只考虑给定的模型族,这使得它们容易受到模型族错误设定的影响。为了解决这个问题,本文将贝叶斯信息论最小消息长度(MML)原则引入物种分布模型选择。具体而言,我们提供了一个框架,该框架允许计算和比较来自多个模型族的模型的消息长度,这样一来,模型选择对于模型族错误设定和数据聚合既准确又稳健。为了高效地找到相关特征,我们进一步开发了一种新颖的搜索算法,该算法不需要计算所有可能特征子集的消息长度。实验结果表明,我们提出的方法在人工数据集和真实世界数据集上通过选择最佳模型优于其他竞争方法。更具体地说,在一次人工数据测试中,所有方法都出错了。在另外10次人工数据测试中,MML方法全部正确,但其他替代方法在各种测试中都失败了。我们的真实世界数据涉及来自巴拿马巴罗科罗拉多岛的两种植物物种。与替代方法相比,对于这两种植物物种,MML方法都选择了最简单的模型,同时总体预测效果也是最好的。