使用全证据方法时缺失数据对拓扑推断的影响。

Effects of missing data on topological inference using a Total Evidence approach.

作者信息

Guillerme Thomas, Cooper Natalie

机构信息

School of Natural Sciences, Trinity College Dublin, Dublin 2, Ireland; Trinity Centre for Biodiversity Research, Trinity College Dublin, Dublin 2, Ireland.

School of Natural Sciences, Trinity College Dublin, Dublin 2, Ireland; Trinity Centre for Biodiversity Research, Trinity College Dublin, Dublin 2, Ireland; Department of Life Sciences, Natural History Museum, Cromwell Road, London SW7 5BD, UK.

出版信息

Mol Phylogenet Evol. 2016 Jan;94(Pt A):146-58. doi: 10.1016/j.ympev.2015.08.023. Epub 2015 Aug 31.

DOI:10.1016/j.ympev.2015.08.023

PMID:26335040

Abstract

To fully understand macroevolutionary patterns and processes, we need to include both extant and extinct species in our models. This requires phylogenetic trees with both living and fossil taxa at the tips. One way to infer such phylogenies is the Total Evidence approach which uses molecular data from living taxa and morphological data from living and fossil taxa. Although the Total Evidence approach is very promising, it requires a great deal of data that can be hard to collect. Therefore this method is likely to suffer from missing data issues that may affect its ability to infer correct phylogenies. Here we use simulations to assess the effects of missing data on tree topologies inferred from Total Evidence matrices. We investigate three major factors that directly affect the completeness and the size of the morphological part of the matrix: the proportion of living taxa with no morphological data, the amount of missing data in the fossil record, and the overall number of morphological characters in the matrix. We infer phylogenies from complete matrices and from matrices with various amounts of missing data, and then compare missing data topologies to the "best" tree topology inferred using the complete matrix. We find that the number of living taxa with morphological characters and the overall number of morphological characters in the matrix, are more important than the amount of missing data in the fossil record for recovering the "best" tree topology. Therefore, we suggest that sampling effort should be focused on morphological data collection for living species to increase the accuracy of topological inference in a Total Evidence framework. Additionally, we find that Bayesian methods consistently outperform other tree inference methods. We therefore recommend using Bayesian consensus trees to fix the tree topology prior to further analyses.

摘要

为了全面理解宏观进化模式和过程，我们需要在模型中纳入现存物种和已灭绝物种。这就要求系统发育树的末梢同时包含现存和化石分类单元。推断此类系统发育关系的一种方法是全证据法，该方法使用现存分类单元的分子数据以及现存和化石分类单元的形态学数据。尽管全证据法很有前景，但它需要大量难以收集的数据。因此，这种方法可能会受到数据缺失问题的影响，这可能会影响其推断正确系统发育关系的能力。在这里，我们使用模拟来评估数据缺失对从全证据矩阵推断出的树拓扑结构的影响。我们研究了直接影响矩阵形态部分完整性和大小的三个主要因素：没有形态学数据的现存分类单元的比例、化石记录中的数据缺失量以及矩阵中形态特征的总数。我们从完整矩阵和具有不同数据缺失量的矩阵中推断系统发育关系，然后将数据缺失的拓扑结构与使用完整矩阵推断出的“最佳”树拓扑结构进行比较。我们发现，对于恢复“最佳”树拓扑结构而言，具有形态特征的现存分类单元的数量以及矩阵中形态特征的总数，比化石记录中的数据缺失量更为重要。因此，我们建议抽样工作应集中在收集现存物种的形态学数据上，以提高全证据框架下拓扑推断的准确性。此外，我们发现贝叶斯方法始终优于其他树推断方法。因此，我们建议在进一步分析之前，使用贝叶斯共识树来确定树拓扑结构。