Department of Organismic and Evolutionary Biology, Museum of Comparative Zoology, Harvard University, Cambridge, MA 02138, United States; Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Evolution, Ecology, and Organismal Biology, University of California, Riverside, Riverside, CA 92521, United States.
Department of Biology, San Diego State University, San Diego, CA 92182, United States; Department of Entomology, University of California, Riverside, Riverside, CA 92521, United States.
Mol Phylogenet Evol. 2019 Oct;139:106562. doi: 10.1016/j.ympev.2019.106562. Epub 2019 Jul 16.
One major challenge to delimiting species with genetic data is successfully differentiating population structure from species-level divergence, an issue exacerbated in taxa inhabiting naturally fragmented habitats. Many fields of science are now using machine learning, and in evolutionary biology supervised machine learning has recently been used to infer species boundaries. These supervised methods require training data with associated labels. Conversely, unsupervised machine learning (UML) uses inherent data structure and does not require user-specified training labels, potentially providing more objectivity in species delimitation. In the context of integrative taxonomy, we demonstrate the utility of three UML approaches (random forests, variational autoencoders, t-distributed stochastic neighbor embedding) for species delimitation in an arachnid taxon with high population genetic structure (Opiliones, Laniatores, Metanonychus). We find that UML approaches successfully cluster samples according to species-level divergences and not high levels of population structure, while model-based validation methods severely over-split putative species. UML offers intuitive data visualization in two-dimensional space, the ability to accommodate various data types, and has potential in many areas of systematic and evolutionary biology. We argue that machine learning methods are ideally suited for species delimitation and may perform well in many natural systems and across taxa with diverse biological characteristics.
用遗传数据来划分物种的一个主要挑战是成功区分种群结构和种间差异,对于栖息在自然碎片化生境中的生物来说,这个问题更加严重。现在许多科学领域都在使用机器学习,在进化生物学中,监督机器学习最近被用于推断物种界限。这些有监督的方法需要具有相关标签的训练数据。相反,无监督机器学习 (UML) 使用固有数据结构,不需要用户指定的训练标签,这在物种划分方面可能更具客观性。在综合分类学的背景下,我们展示了三种 UML 方法(随机森林、变分自动编码器、t 分布随机近邻嵌入)在具有高度种群遗传结构的蛛形纲动物分类群(Laniatores,Metanonychus)中的物种划分的实用性。我们发现,UML 方法可以根据种间差异成功地对样本进行聚类,而不是根据种群结构的高度聚类,而基于模型的验证方法则严重过度分割了假定的物种。UML 提供了直观的二维空间数据可视化,能够适应各种数据类型,并且在系统学和进化生物学的许多领域都有潜力。我们认为,机器学习方法非常适合物种划分,并且可能在许多自然系统和具有不同生物特征的分类群中表现良好。