Resheff Yehezkel S, Bensch Hanna M, Zöttl Markus, Harel Roi, Matsumoto-Oda Akiko, Crofoot Margaret C, Gomez Sara, Börger Luca, Rotics Shay
Hebrew University Business School, The Hebrew University of Jerusalem, Jerusalem, Israel.
Department of Biology and Environmental Science, Centre for Ecology and Evolution in Microbial Model Systems (EEMIS), Linnaeus University, 391 82, Kalmar, Sweden.
Mov Ecol. 2024 Jun 10;12(1):44. doi: 10.1186/s40462-024-00485-7.
The application of supervised machine learning methods to identify behavioural modes from inertial measurements of bio-loggers has become a standard tool in behavioural ecology. Several design choices can affect the accuracy of identifying the behavioural modes. One such choice is the inclusion or exclusion of segments consisting of more than a single behaviour (mixed segments) in the machine learning model training data. Currently, the common practice is to ignore such segments during model training. In this paper we tested the hypothesis that including mixed segments in model training will improve accuracy, as the model would perform better in identifying them in the test data. We test this hypothesis using a series of data simulations on four datasets of accelerometer data coupled with behaviour observations, obtained from four study species (Damaraland mole-rats, meerkats, olive baboons, polar bears). Results show that when a substantial proportion of the test data are mixed behaviour segments (above ~ 10%), including mixed segments in machine learning model training improves the accuracy of classification. These results were consistent across the four study species, and robust to changes in segment length, sample size, and degree of mixture within the mixed segments. However, we also find that in some cases (particularly in baboons) models trained with mixed segments show reduced accuracy in classifying test data containing only single behaviour (pure) segments, compared to models trained without mixed segments. Based on these results, we recommend that when the classification model is expected to deal with a substantial proportion of mixed behaviour segments (> 10%), it is beneficial to include them in model training, otherwise, it is unnecessary but also not harmful. The exception is when there is a basis to assume that the training data contains a higher rate of mixed segments than the actual (unobserved) data to be classified-such a situation may occur particularly when training data are collected in captivity and used to classify data from the wild. In this case, excess inclusion of mixed segments in training data should probably be avoided.
将监督式机器学习方法应用于从生物记录器的惯性测量中识别行为模式,已成为行为生态学中的一种标准工具。有几个设计选择会影响行为模式识别的准确性。其中一个选择是在机器学习模型训练数据中包含或排除由多种行为组成的片段(混合片段)。目前,常见的做法是在模型训练期间忽略这些片段。在本文中,我们测试了这样一个假设,即在模型训练中包含混合片段会提高准确性,因为模型在测试数据中识别它们时会表现得更好。我们使用一系列数据模拟对四个加速度计数据集进行了测试,这些数据集与行为观察结果相结合,分别来自四个研究物种(达马拉兰鼹鼠、狐獴、东非狒狒、北极熊)。结果表明,当测试数据中有相当比例是混合行为片段(超过约10%)时,在机器学习模型训练中包含混合片段可提高分类准确性。这些结果在四个研究物种中都是一致的,并且对于混合片段的长度、样本大小和混合程度的变化具有稳健性。然而,我们也发现,在某些情况下(特别是在狒狒中),与不包含混合片段训练的模型相比,包含混合片段训练的模型在对仅包含单一行为(纯)片段的测试数据进行分类时准确性会降低。基于这些结果,我们建议,当预期分类模型要处理相当比例的混合行为片段(>10%)时,将它们包含在模型训练中是有益的,否则,这没有必要但也无害。例外情况是,当有理由假设训练数据中混合片段的比例高于实际(未观察到的)待分类数据时——这种情况可能尤其会在圈养环境中收集训练数据并用于对野外数据进行分类时发生。在这种情况下,可能应避免在训练数据中过度包含混合片段。