Department of Cartographic and Land Engineering, Higher Polytechnic School of Avila, University of Salamanca, Ávila, Spain.
Department of Geology, Facultad de Ciencia y Tecnología, Universidad del País Vasco-Euskal Herriko Unibertsitatea (UPV/EHU), Leioa, Spain.
Am J Biol Anthropol. 2023 Jul;181(3):454-473. doi: 10.1002/ajpa.24754. Epub 2023 May 17.
Data collection is a major hindrance in many types of analyses in human evolutionary studies. This issue is fundamental when considering the scarcity and quality of fossil data. From this perspective, many research projects are impeded by the amount of data available to perform tasks such as classification and predictive modeling.
Here we present the use of Monte Carlo based methods for the simulation of paleoanthropological data. Using two datasets containing cross-sectional biomechanical information and geometric morphometric 3D landmarks, we show how synthetic, yet realistic, data can be simulated to enhance each dataset, and provide new information with which to perform complex tasks with, in particular classification. We additionally present these algorithms in the form of an R library; AugmentationMC. We also use a geometric morphometric dataset to simulate 3D models, and emphasize the power of Machine Teaching, as opposed to Machine Learning.
Our results show how Monte Carlo based algorithms, such as the Markov Chain Monte Carlo, are useful for the simulation of morphometric data, providing synthetic yet highly realistic data that has been tested statistically to be equivalent to the original data. We additionally provide a critical overview of bootstrapping techniques, showing how Monte Carlo based methods perform better than bootstrapping as the data simulated is not an exact copy of the original sample.
While synthetic datasets should never replace large and real datasets, this can be considered an important advance in how paleoanthropological data can be handled.
在人类进化研究的许多类型的分析中,数据收集是一个主要障碍。在考虑化石数据的稀缺性和质量时,这个问题是根本性的。从这个角度来看,许多研究项目受到可用数据量的限制,无法执行分类和预测建模等任务。
在这里,我们提出了使用基于蒙特卡罗的方法来模拟古人类学数据。使用包含两个数据集的交叉部分生物力学信息和几何形态学 3D 地标,我们展示了如何模拟合成但真实的数据,以增强每个数据集,并提供新的信息,以便执行复杂的任务,特别是分类。我们还以 R 库的形式呈现这些算法;增强 MC。我们还使用几何形态学数据集来模拟 3D 模型,并强调机器教学的力量,而不是机器学习。
我们的结果表明,基于蒙特卡罗的算法(如马尔可夫链蒙特卡罗)对于形态计量数据的模拟非常有用,提供了合成但高度真实的数据,这些数据已经经过统计学测试,与原始数据等效。我们还提供了对引导技术的批判性概述,表明基于蒙特卡罗的方法比引导更好,因为模拟的数据不是原始样本的精确副本。
虽然合成数据集永远不应替代大型和真实数据集,但这可以被认为是如何处理古人类学数据的重要进展。