Marques Carolina S, Dufourq Emmanuel, Pereira Soraia, Santos Vanda F, Malafaia Elisabete
Centro de Estatística e Aplicações, Faculdade de Ciências, Universidade de Lisboa, Lisboa, Portugal.
African Institute for Mathematical Sciences, Muizenberg, South Africa.
PeerJ. 2025 Mar 26;13:e19116. doi: 10.7717/peerj.19116. eCollection 2025.
Classifying objects, such as taxonomic identification of fossils based on morphometric variables, is a time-consuming process. This task is further complicated by intra-class variability, which makes it ideal for automation via machine learning (ML) techniques. In this study, we compared six different ML techniques based on datasets with morphometric features used to classify isolated theropod teeth at both genus and higher taxonomic levels. Our model also intends to differentiate teeth from different positions on the tooth row (, lateral, mesial). These datasets present different challenges like over-representation of certain classes and missing measurements. Given the class imbalance, we evaluate the effect of different standardization and oversampling techniques on the classification process for different classification models. The obtained results show that some classification models are more sensitive to class imbalance than others. This study presents a novel comparative analysis of multi-class classification methods for theropod teeth, evaluating their performance across varying taxonomic levels and dataset balancing techniques. The aim of this study is to evaluate which ML methods are more suitable for the classification of isolated theropod teeth, providing recommendations on how to deal with imbalanced datasets using different standardization, oversampling, and classification tools. The trained models and applied standardizations are made publicly available, providing a resource for future studies to classify isolated theropod teeth. This open-access methodology will enable more reliable cross-study comparisons of fossil records.
对物体进行分类,例如基于形态测量变量对化石进行分类鉴定,是一个耗时的过程。类内变异性使这项任务更加复杂,这使得通过机器学习(ML)技术实现自动化成为理想选择。在本研究中,我们基于具有形态特征的数据集,比较了六种不同的ML技术,这些数据集用于在属和更高分类级别对孤立的兽脚亚目恐龙牙齿进行分类。我们的模型还旨在区分来自齿列不同位置(如外侧、内侧)的牙齿。这些数据集呈现出不同的挑战,如某些类别的过度代表性和缺失测量值。鉴于类别不平衡,我们评估了不同标准化和过采样技术对不同分类模型分类过程的影响。所得结果表明,一些分类模型比其他模型对类别不平衡更敏感。本研究对兽脚亚目恐龙牙齿的多类分类方法进行了新颖的比较分析,评估了它们在不同分类级别和数据集平衡技术下的性能。本研究的目的是评估哪种ML方法更适合对孤立的兽脚亚目恐龙牙齿进行分类,为如何使用不同的标准化、过采样和分类工具处理不平衡数据集提供建议。经过训练的模型和应用的标准化方法已公开提供,为未来对孤立的兽脚亚目恐龙牙齿进行分类的研究提供了资源。这种开放获取的方法将使化石记录的跨研究比较更加可靠。