在合并模型下缺失数据对物种树估计的影响。

Effects of missing data on species tree estimation under the coalescent.

机构信息

Department of Statistics, The Ohio State University, 404 Cockins Hall, 1958 Neil Avenue, Columbus, OH 43210, United States.

出版信息

Mol Phylogenet Evol. 2013 Dec;69(3):1057-62. doi: 10.1016/j.ympev.2013.06.004. Epub 2013 Jun 13.

DOI:10.1016/j.ympev.2013.06.004

PMID:23769751

Abstract

With recent advances in genomic sequencing, the importance of taking the effects of the processes that can cause discord between the speciation history and the individual gene histories into account has become evident. For multilocus datasets, it is difficult to achieve complete coverage of all sampled loci across all sample specimens, a problem that also arises when combining incompletely overlapping datasets. Here we examine how missing data affects the accuracy of species tree reconstruction. In our study, 10- and 100-locus sequence datasets were simulated under the coalescent model from shallow and deep speciation histories, and species trees were estimated using the maximum likelihood and Bayesian frameworks (with STEM and (*)BEAST, respectively). The accuracy of the estimated species trees was evaluated using the symmetric difference and the SPR distance. We examine the effects of sampling more than one individual per species, as well as the effects of different patterns of missing data (i.e., different amounts of missing data, which is represented among random taxa as opposed to being concentrated in specific taxa, as is often the case for empirical studies). Our general conclusion is that the species tree estimates are remarkably resilient to the effects of missing data. We find that for datasets with more limited numbers of loci, sampling more than one individual per species has the strongest effect on improving species tree accuracy when there is missing data, especially at higher degrees of missing data. For larger multilocus datasets (e.g., 25-100 loci), the amount of missing data has a negligible effect on species tree reconstruction, even at 50% missing data and a single sampled individual per species.

摘要

随着基因组测序技术的最新进展，考虑到可能导致物种形成历史和个体基因历史之间出现不一致的过程的影响变得至关重要。对于多点数据集，很难在所有样本标本中实现对所有采样基因座的完全覆盖，当组合不完整的重叠数据集时，也会出现这个问题。在这里，我们研究了缺失数据如何影响物种树重建的准确性。在我们的研究中，在浅度和深度物种形成历史下，从凝聚模型模拟了 10 个和 100 个基因座的序列数据集，并使用最大似然和贝叶斯框架（分别为 STEM 和（*）BEAST）估计了物种树。使用对称差异和 SPR 距离评估估计的物种树的准确性。我们检查了对每个物种采样多个个体的影响，以及不同缺失数据模式的影响（即，缺失数据的数量不同，在随机分类群中代表缺失数据，而不是像实证研究那样集中在特定分类群中）。我们的总体结论是，物种树估计对缺失数据的影响具有很强的弹性。我们发现，对于具有更有限数量基因座的数据集，在存在缺失数据时，对每个物种采样多个个体对提高物种树准确性的影响最大，尤其是在更高程度的缺失数据时。对于更大的多点数据集（例如，25-100 个基因座），即使缺失数据达到 50%且每个物种仅采样一个个体，缺失数据量对物种树重建的影响也可以忽略不计。