Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ.
Department of Herpetology, California Academy of Sciences, USA.
Mol Biol Evol. 2023 May 2;40(5). doi: 10.1093/molbev/msad109.
The data available for reconstructing molecular phylogenies have become wildly disparate. Phylogenomic studies can generate data for thousands of genetic markers for dozens of species, but for hundreds of other taxa, data may be available from only a few genes. Can these two types of data be integrated to combine the advantages of both, addressing the relationships of hundreds of species with thousands of genes? Here, we show that this is possible, using data from frogs. We generated a phylogenomic data set for 138 ingroup species and 3,784 nuclear markers (ultraconserved elements [UCEs]), including new UCE data from 70 species. We also assembled a supermatrix data set, including data from 97% of frog genera (441 total), with 1-307 genes per taxon. We then produced a combined phylogenomic-supermatrix data set (a "gigamatrix") containing 441 ingroup taxa and 4,091 markers but with 86% missing data overall. Likelihood analysis of the gigamatrix yielded a generally well-supported tree among families, largely consistent with trees from the phylogenomic data alone. All terminal taxa were placed in the expected families, even though 42.5% of these taxa each had >99.5% missing data and 70.2% had >90% missing data. Our results show that missing data need not be an impediment to successfully combining very large phylogenomic and supermatrix data sets, and they open the door to new studies that simultaneously maximize sampling of genes and taxa.
用于重建分子系统发育的可用数据变得千差万别。系统基因组学研究可以为数十个物种的数千个遗传标记生成数据,但对于数百个其他分类群,可能只有少数几个基因的数据可用。这两种类型的数据能否整合在一起,同时利用两者的优势,解决数百个物种与数千个基因的关系问题?在这里,我们使用青蛙的数据表明这是可能的。我们为 138 个内类群物种和 3784 个核标记(超保守元件 [UCEs])生成了一个系统基因组数据集,其中包括 70 个物种的新 UCE 数据。我们还组装了一个超级矩阵数据集,包括 97%的青蛙属(共 441 个)的数据,每个分类群有 1-307 个基因。然后,我们生成了一个包含 441 个内类群分类群和 4091 个标记的组合系统基因组-超级矩阵数据集(“gigamatrix”),但总体上有 86%的数据缺失。gigamatrix 的似然分析产生了一个在科之间总体上得到很好支持的树,与仅从系统基因组数据得出的树基本一致。所有终端分类群都被放置在预期的科中,尽管 42.5%的分类群每个都有>99.5%的数据缺失,70.2%的分类群有>90%的数据缺失。我们的结果表明,缺失数据不一定是成功组合非常大的系统基因组和超级矩阵数据集的障碍,并且为同时最大限度地增加基因和分类群采样的新研究开辟了道路。