Suppr超能文献

复合分类单元在系统发育分析中的使用和有效性。

The use and validity of composite taxa in phylogenetic analysis.

机构信息

Département de sciences biologiques, Université de Montréal, C.P. 6128, Succ. Centre-ville, Montréal, Québec H3C 3J7, Canada.

出版信息

Syst Biol. 2009 Dec;58(6):560-72. doi: 10.1093/sysbio/syp056. Epub 2009 Sep 21.

Abstract

In phylogenetic analysis, one possible approach to minimize missing data in DNA supermatrices consists in sampling sequences from different species to obtain a complete sequence for all genes included in the study. We refer to those complete sequences as composite taxa because DNA sequences that are combined belong to different species. An alternative approach is to analyze incomplete supermatrices by coding unavailable DNA sequences as missing. The accuracy of phylogenetic trees estimated using matrices that include composite taxa has recently been questioned, and the best approach for analyzing incomplete supermatrices is highly debated. Through computer simulations, we compared the phylogenetic accuracy of the 2 competing approaches. We explored the effect of composite taxa when inferring higher level relationships, that is, relationships between monophyletic groups. DNA sequences were simulated on a 42-taxon model tree and incomplete supermatrices containing different percentages of missing data were generated. These incomplete supermatrices were analyzed either by coding the missing data with "?" or by reducing the amount of missing data through the combination of 2 or more taxa to generate composite taxa. Of 180 comparisons (18 simulation cases with 2 different inference methods and 5 levels of incompleteness), we observed significantly higher phylogenetic accuracies for composite matrices in 46 comparisons, whereas missing data matrices outperformed composites in 8 comparisons. In all other cases, the phylogenetic accuracy obtained with composite matrices was not significantly different from that of missing data matrices. This study demonstrates that composite taxa represent an interesting approach to minimize the amount of missing data in supermatrices and we suggest that it is the optimal approach to use in phylogenomic studies to reduce computing time.

摘要

在系统发育分析中,一种最小化 DNA 超矩阵中缺失数据的可能方法是从不同物种中采样序列,以获得研究中包含的所有基因的完整序列。我们将这些完整序列称为复合分类单元,因为组合在一起的 DNA 序列属于不同的物种。另一种方法是通过将不可用的 DNA 序列编码为缺失来分析不完整的超矩阵。最近,人们对使用包含复合分类单元的矩阵估计系统发育树的准确性提出了质疑,并且分析不完整超矩阵的最佳方法存在很大争议。通过计算机模拟,我们比较了两种竞争方法的系统发育准确性。我们探讨了在推断更高水平的关系(即单系群之间的关系)时复合分类单元的影响。在 42 分类单元模型树上模拟了 DNA 序列,并生成了包含不同缺失数据百分比的不完整超矩阵。这些不完整的超矩阵要么通过用“?”编码缺失数据进行分析,要么通过将两个或更多分类单元组合起来生成复合分类单元来减少缺失数据的数量。在 180 次比较(18 个模拟案例,使用 2 种不同的推断方法和 5 个不完整级别)中,我们观察到在 46 次比较中,复合矩阵的系统发育准确性明显更高,而在 8 次比较中,缺失数据矩阵的表现优于复合矩阵。在所有其他情况下,使用复合矩阵获得的系统发育准确性与缺失数据矩阵没有显著差异。这项研究表明,复合分类单元是一种减少超矩阵中缺失数据量的有趣方法,我们建议在基因组学研究中使用它来减少计算时间。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验