Filipski Alan, Murillo Oscar, Freydenzon Anna, Tamura Koichiro, Kumar Sudhir
Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State University.
Center for Evolutionary Medicine and Informatics, Biodesign Institute, Arizona State UniversitySchool of Life Sciences, Arizona State University.
Mol Biol Evol. 2014 Sep;31(9):2542-50. doi: 10.1093/molbev/msu200. Epub 2014 Jun 27.
Scientists are assembling sequence data sets from increasing numbers of species and genes to build comprehensive timetrees. However, data are often unavailable for some species and gene combinations, and the proportion of missing data is often large for data sets containing many genes and species. Surprisingly, there has not been a systematic analysis of the effect of the degree of sparseness of the species-gene matrix on the accuracy of divergence time estimates. Here, we present results from computer simulations and empirical data analyses to quantify the impact of missing gene data on divergence time estimation in large phylogenies. We found that estimates of divergence times were robust even when sequences from a majority of genes for most of the species were absent. From the analysis of such extremely sparse data sets, we found that the most egregious errors occurred for nodes in the tree that had no common genes for any pair of species in the immediate descendant clades of the node in question. These problematic nodes can be easily detected prior to computational analyses based only on the input sequence alignment and the tree topology. We conclude that it is best to use larger alignments, because adding both genes and species to the alignment augments the number of genes available for estimating divergence events deep in the tree and improves their time estimates.
科学家们正在收集越来越多物种和基因的序列数据集,以构建全面的时间树。然而,某些物种和基因组合的数据往往无法获取,而且对于包含许多基因和物种的数据集来说,缺失数据的比例通常很大。令人惊讶的是,尚未对物种 - 基因矩阵的稀疏程度对分歧时间估计准确性的影响进行系统分析。在此,我们展示了计算机模拟和实证数据分析的结果,以量化缺失基因数据对大型系统发育中分歧时间估计的影响。我们发现,即使大多数物种的大多数基因序列缺失,分歧时间的估计仍然稳健。通过对如此极端稀疏的数据集进行分析,我们发现,对于所讨论节点的直接后代分支中任何一对物种都没有共同基因的树节点,会出现最严重的错误。仅基于输入序列比对和树拓扑结构,这些有问题的节点在计算分析之前就可以很容易地被检测到。我们得出结论,最好使用更大的比对,因为在比对中同时添加基因和物种会增加可用于估计树中深处分歧事件的基因数量,并改善对它们的时间估计。